Generate strings based on regular expressions
During my PhD thesis, I have partly worked on the problem of the automatic accurate test data generation. In order to be complete and self-contained, I have addressed all kinds of data types, including strings. This article aims at showing how to generate accurate and relevant strings under several constraints.
What is a regular expression?
We are talking about formal language theory here. In the known world, there are four kinds of languages. More formally, in 1956, the Chomsky hierarchy has been formulated, classifying grammars (which define languages) in four levels:
- unrestricted grammars, matching langages known as Turing languages, no restriction,
- context-sensitive grammars, matching contextual languages,
- context-free grammars, matching algebraic languages, based on stacked automata,
- regular grammars, matching regular languages.
Each level includes the next level. The last level is the “weaker”, which must not sound negative here. Regular expressions are used often because of their simplicity and also because they solve most problems we encounter daily.
A regular expression is a small language with very few operators and,
most of the time, a simple semantics. For instance ab(c|d)
means: a
word (a data) starting by ab
and followed by c
or d
. We also have
quantification operators (also known as repetition operators), such as
?
, *
and +
. We also have {_x_,_y_}
to define a repetition
between _x_
and _y_
. Thus, ?
is equivalent to {0,1}
, *
to
{0,}
and +
to {1,}
. When _y_
is missing, it means +∞, so
unbounded (or more exactly, bounded by the limits of the machine). So,
for instance ab(c|d){2,4}e?
means: a word starting by ab
, followed
2, 3 or 4 times by c
or d
(so cc
, cd
, dc
, ccc
, ccd
, cdc
and so on) and potentially followed by e
.
The goal here is not to teach you regular expressions but this is kind of a tiny reminder. There are plenty of regular languages. You might know POSIX regular expression or Perl Compatible Regular Expressions (PCRE). Forget the first one, please. The syntax and the semantics are too much limited. PCRE is the regular language I recommend all the time.
Behind every formal language there is a graph. A regular expression is compiled into a Finite State Machine (FSM). I am not going to draw and explain them, but it is interesting to know that behind a regular expression there is a basic automaton. No magic.
Why focussing regular expressions?
This article focuses on regular languages instead of other kind of languages because we use them very often (even daily). I am going to address context-free languages in another article, be patient young padawan. The needs and constraints with other kind of languages are not the same and more complex algorithms must be involved. So we are going easy for the first step.
Understanding PCRE: lex and parse them
The Hoa\Compiler
library provides
both LL(1) LL(k) compiler-compilers. The
documentation
describes how to use it. We discover that the LL(k)
compiler comes with a grammar description language called PP. What does
it mean? It means for instance that the grammar of the PCRE can be
written with the PP language and that Hoa\Compiler\Llk
will transform
this grammar into a compiler. That's why we call them “compiler of
compilers”.
Fortunately, the Hoa\Regex
library
provides the grammar of the PCRE language in the
hoa://Library/Regex/Grammar.pp
file. Consequently, we are able to analyze regular expressions written
in the PCRE language! Let's try in a shell at first with the
hoa compiler:pp
tool:
|
> #expression
> > #concatenation
> > > token()
> > > token()
> > > #quantification
> > > > #alternation
> > > > > token()
> > > > > token()
> > > > token()
> > > #quantification
> > > > token()
> > > > token()
We read that the whole expression is composed of a single concatenation
of two tokens: a
and b
, followed by a quantification, followed by
another quantification. The first quantification is an alternation of (a
choice betwen) two tokens: c
and d
, between 2 to 4 times. The second
quantification is the e
token that can appear zero or one time. Pretty
simple.
The final output of the Hoa\Compiler\Llk\Parser
class is an Abstract
Syntax Tree (AST).
The documentation of Hoa\Compiler
explains all that stuff, you should read
it. The LL(k) compiler is cut out into very distinct layers in order to improve
hackability. Again, the documentation teach us we have four levels in the
compilation
process:
lexical analyzer, syntactic analyzer, trace and AST. The lexical analyzer (also
known as lexer) transforms the textual data being analyzed into a sequence of
tokens (formally known as lexemes). It checks whether the data is composed of
the good pieces. Then, the syntactic analyzer (also known as parser) checks that
the order of tokens in this sequence is correct (formally we say that it derives
the sequence, see the Matching words
section
to learn more).
Still in the shell, we can get the result of the lexical analyzer by
using the --token-sequence
option; thus:
|
# … token name token value offset
|
)
This is the sequence of tokens produced by the lexical analyzer. The tree is not yet built because this is the first step of the compilation process. However this is always interesting to understand these different steps and see how it works.
Now we are able to analyze any regular expressions in the PCRE format! The result of this analysis is a tree. You know what is fun with trees? Visiting them.
Visiting the AST
Unsurprisingly, each node of the AST can be visited thanks to the Hoa\Visitor
library. Here is an example with the
“dump” visitor:
This program will print the same AST dump we have previously seen in the shell.
How to write our own visitor? A visitor is a class with a single visit
method. Let's try a visitor that pretty print a regular expression, i.e.
transform:
ab(c|d){2,4}e?
into:
a
b
(
c
|
d
){2,4}
e?
Why a pretty printer? First, it shows how to visit a tree. Second, it
shows the structure of the visitor: we filter by node ID (#expression
,
#quantification
, token
etc.) and we apply respective computations. A
pretty printer is often a good way for being familiarized with the
structure of an AST.
Here is the class. It catches only useful constructions for the given example:
And finally, we apply the pretty printer on the AST like previously seen:
Et voilà !
Now, put all that stuff together!
Isotropic generation
We can use Hoa\Regex
and Hoa\Compiler
to get the AST of any regular
expressions written in the PCRE format. We can use Hoa\Visitor
to
traverse the AST and apply computations according to the type of nodes.
Our goal is to generate strings based on regular expressions. What kind
of generation are we going to use? There are plenty of them: uniform
random, smallest, coverage based…
The simplest is isotropic generation, also known as random generation.
But random says nothing: what is the repartition, or do we have any
uniformity? Isotropic means each choice will be solved randomly and
uniformly. Uniformity has to be defined: does it include the whole set
of nodes or just the immediate children of the node? Isotropic means we
consider only immediate children. For instance, a node #alternation
has c immediate children, the probability C to choose one child is:
Yes, simple as that!
We can use the Hoa\Math
library that
provides the Hoa\Math\Sampler\Random
class to sample uniform random integers
and floats. Ready?
Structure of the visitor
The structure of the visitor is the following:
We set a sampler and we start visiting and filtering nodes by their node
ID. The following code will generate a string based on the regular
expression contained in the $expression
variable:
We are going to change the value of $expression
step by step until
having ab(c|d){2,4}e?
.
Case of #expression
A node of type #expression
has only one child. Thus, we simply return
the computation of this node:
Case of token
We consider only one type of token for now: literal
. A literal can
contain an escaped character, can be a single character or can be .
(which means everything). We consider only a single character for this
example (spoil: the whole visitor already exists). Thus:
Here, with $expression = 'a';
we get the string a
.
Case of #concatenation
A concatenation is just the computation of all children joined in a single piece of string. Thus:
At this step, with $expression = 'ab';
we get the string ab
. Totally
crazy.
Case of #alternation
An alternation is a choice between several children. All we have to do
is to select a child based on the probability given above. The number of
children for the current node can be known thanks to the
getChildrenNumber
method. We are also going to use the sampler of
integers. Thus:
Now, with $expression = 'ab(c|d)';
we get the strings abc
or abd
at random. Try several times to see by yourself.
Case of #quantification
A quantification is an alternation of concatenations. Indeed, e{2,4}
is strictly equivalent to ee|eee|eeee
. We have only two
quantifications in our example: ?
and {_x_,_y_}
. We are going to
find the value for _x_
and _y_
and then choose at random between
these bounds. Let's go:
Finally, with $expression = 'ab(c|d){2,4}e?';
we can have the
following strings: abdcce
, abdc
, abddcd
, abcde
etc. Nice isn't
it? Want more?
Performance
This is difficult to give numbers because it depends of a lot of parameters: your machine configuration, the PHP VM, if other programs run etc. But I have generated 1 million strings in less than 25 seconds on my machine (an old MacBook Pro), which is pretty reasonable.
Conclusion and surprise
So, yes, now we know how to generate strings based on regular
expressions! Supporting all the PCRE format is difficult. That's why the
Hoa\Regex
library provides
the Hoa\Regex\Visitor\Isotropic
class that is a more advanced visitor.
This latter supports classes, negative classes, ranges, all
quantifications, all kinds of literals (characters, escaped characters,
types of characters —\w
, \d
, \h
…—) etc. Consequently, all you have
to do is:
This algorithm is used in Praspel, a specification language I have designed during my PhD thesis. More specifically, this algorithm is used inside realistic domains. I am not going to explain it today but it allows me to introduce the “surprise”.
Generate strings based on regular expressions in atoum
atoum is an awesome unit test framework. You can
use the Atoum\PraspelExtension
extension
to use Praspel and therefore realistic domains inside atoum. You can use
realistic domains to validate and to generate data, they are
designed for that. Obviously, we can use the Regex
realistic domain.
This extension provides several features including sample
,
sampleMany
and predicate
to respectively generate one datum,
generate many data and validate a datum based on a realistic domain. To
declare a regular expression, we must write:
And to generate a datum, all we have to do is:
For instance, imagine you are writing a test called test_mail
and you
need an email address:
Easy to read, fast to execute and help to focus on the logic of the test instead of test data (also known as fixtures). Note that most of the time the regular expressions are already in the code (maybe as constants). It is therefore easier to write and to maintain the tests.
I hope you enjoyed this first part of the series :-)! This work has been published in the International Conference on Software Testing, Verification and Validation: Grammar-Based Testing using Realistic Domains in PHP.