Generate strings based on regular expressions

During my PhD thesis, I have partly worked on the problem of the automatic accurate test data generation. In order to be complete and self-contained, I have addressed all kinds of data types, including strings. This article is the first one of a little series that aims at showing how to generate accurate and relevant strings under several constraints.

What is a regular expression?

We are talking about formal language theory here. In the known world, there are four kinds of languages. More formally, in 1956, the Chomsky hierarchy has been formulated, classifying grammars (which define languages) in four levels:

  1. unrestricted grammars, matching langages known as Turing languages, no restriction,
  2. context-sensitive grammars, matching contextual languages,
  3. context-free grammars, matching algebraic languages, based on stacked automata,
  4. regular grammars, matching regular languages.

Each level includes the next level. The last level is the “weaker”, which must not sound negative here. Regular expressions are used often because of their simplicity and also because they solve most problems we encounter daily.

A regular expression is a small language with very few operators and, most of the time, a simple semantics. For instance ab(c|d) means: a word (a data) starting by ab and followed by c or d. We also have quantification operators (also known as repetition operators), such as ?, * and +. We also have {x,y} to define a repetition between x and y. Thus, ? is equivalent to {0,1}, * to {0,} and + to {1,}. When y is missing, it means \displaystyle +\infty , so unbounded (or more exactly, bounded by the limits of the machine). So, for instance ab(c|d){2,4}e? means: a word starting by ab, followed 2, 3 or 4 times by c or d (so cc, cd, dc, ccc, ccd, cdc and so on) and potentially followed by e.

The goal here is not to teach you regular expressions but this is kind of a tiny reminder. There are plenty of regular languages. You might know POSIX regular expression or Perl Compatible Regular Expressions (PCRE). Forget the first one, please. The syntax and the semantics are too much limited. PCRE is the regular language I recommend all the time.

Behind every formal language there is a graph. A regular expression is compiled into a Finite State Machine (FSM). I am not going to draw and explain them, but it is interesting to know that behind a regular expression there is a basic automaton. No magic.

Why focussing regular expressions?

This article focuses on regular languages instead of other kind of languages because we use them very often (even daily). I am going to address context-free languages in another article, be patient young padawan. The needs and constraints with other kind of languages are not the same and more complex algorithms must be involved. So we are going easy for the first step.

Understanding PCRE: lex and parse them

The Hoa\Compiler library provides both \displaystyle LL(1) and \displaystyle LL(k) compiler-compilers. The documentation describes how to use it. We discover that the \displaystyle LL(k) compiler comes with a grammar description language called PP. What does it mean? It means for instance that the grammar of the PCRE can be written with the PP language and that Hoa\Compiler\Llk will transform this grammar into a compiler. That’s why we call them “compiler of compilers”.

Fortunately, the Hoa\Regex library provides the grammar of the PCRE language in the hoa://Library/Regex/Grammar.pp file. Consequently, we are able to analyze regular expressions written in the PCRE language! Let’s try in a shell at first with the hoa compiler:pp tool:

$ echo 'ab(c|d){2,4}e?' | hoa compiler:pp hoa://Library/Regex/Grammar.pp 0 --visitor dump
>  #expression
>  >  #concatenation
>  >  >  token(literal, a)
>  >  >  token(literal, b)
>  >  >  #quantification
>  >  >  >  #alternation
>  >  >  >  >  token(literal, c)
>  >  >  >  >  token(literal, d)
>  >  >  >  token(n_to_m, {2,4})
>  >  >  #quantification
>  >  >  >  token(literal, e)
>  >  >  >  token(zero_or_one, ?)

We read that the whole expression is composed of a single concatenation of two tokens: a and b, followed by a quantification, followed by another quantification. The first quantification is an alternation of (a choice betwen) two tokens: c and d, between 2 to 4 times. The second quantification is the e token that can appear zero or one time. Pretty simple.

The final output of the Hoa\Compiler\Llk\Parser class is an Abstract Syntax Tree (AST). The documentation of Hoa\Compiler explains all that stuff, you should read it. The \displaystyle LL(k) compiler is cut out into very distinct layers in order to improve hackability. Again, the documentation teach us we have four levels in the compilation process: lexical analyzer, syntactic analyzer, trace and AST. The lexical analyzer (also known as lexer) transforms the textual data being analyzed into a sequence of tokens (formally known as lexemes). It checks whether the data is composed of the good pieces. Then, the syntactic analyzer (also known as parser) checks that the order of tokens in this sequence is correct (formally we say that it derives the sequence, see the Matching words section to learn more).

Still in the shell, we can get the result of the lexical analyzer by using the --token-sequence option; thus:

$ echo 'ab(c|d){2,4}e?' | hoa compiler:pp hoa://Library/Regex/Grammar.pp 0 --token-sequence
  #  …  token name   token value  offset
-----------------------------------------
  0  …  literal      a                 0
  1  …  literal      b                 1
  2  …  capturing_   (                 2
  3  …  literal      c                 3
  4  …  alternation  |                 4
  5  …  literal      d                 5
  6  …  _capturing   )                 6
  7  …  n_to_m       {2,4}             7
  8  …  literal      e                12
  9  …  zero_or_one  ?                13
 10  …  EOF                           15

This is the sequence of tokens produced by the lexical analyzer. The tree is not yet built because this is the first step of the compilation process. However this is always interesting to understand these different steps and see how it works.

Now we are able to analyze any regular expressions in the PCRE format! The result of this analysis is a tree. You know what is fun with trees? Visiting them.

Visiting the AST

Unsurprisingly, each node of the AST can be visited thanks to the Hoa\Visitor library. Here is an example with the “dump” visitor:

use Hoa\Compiler;
use Hoa\File;

// 1. Load grammar.
$compiler = Compiler\Llk\Llk::load(
    new File\Read('hoa://Library/Regex/Grammar.pp')
);

// 2. Parse a data.
$ast      = $compiler->parse('ab(c|d){2,4}e?');

// 3. Dump the AST.
$dump     = new Compiler\Visitor\Dump();
echo $dump->visit($ast);

This program will print the same AST dump we have previously seen in the shell.

How to write our own visitor? A visitor is a class with a single visit method. Let’s try a visitor that pretty print a regular expression, i.e. transform:

ab(c|d){2,4}e?

into:

a
b
(
    c
    |
    d
){2,4}
e?

Why a pretty printer? First, it shows how to visit a tree. Second, it shows the structure of the visitor: we filter by node ID (#expression, #quantification, token etc.) and we apply respective computations. A pretty printer is often a good way for being familiarized with the structure of an AST.

Here is the class. It catches only useful constructions for the given example:

use Hoa\Visitor;

class PrettyPrinter implements Visitor\Visit {

    public function visit ( Visitor\Element $element,
                            &$handle = null,
                            $eldnah  = null ) {

        static $_indent = 0;

        $out    = null;
        $nodeId = $element->getId();

        switch($nodeId) {

            // Reset indentation and…
            case '#expression':
                $_indent = 0;

            // … visit all the children.
            case '#quantification':
                foreach($element->getChildren() as $child)
                    $out .= $child->accept($this, $handle, $eldnah);
              break;

            // One new line between each children of the concatenation.
            case '#concatenation':
                foreach($element->getChildren() as $child)
                    $out .= $child->accept($this, $handle, $eldnah) . "\n";
              break;

            // Add parenthesis and increase indentation.
            case '#alternation':
                $oout = [];

                $pIndent = str_repeat('    ', $_indent);
                ++$_indent;
                $cIndent = str_repeat('    ', $_indent);

                foreach($element->getChildren() as $child)
                    $oout[] = $cIndent . $child->accept($this, $handle, $eldnah);

                --$_indent;
                $out .= $pIndent . '(' . "\n" .
                        implode("\n" . $cIndent . '|' . "\n", $oout) . "\n" .
                        $pIndent . ')';
              break;

            // Print token value verbatim.
            case 'token':
                $tokenId    = $element->getValueToken();
                $tokenValue = $element->getValueValue();

                switch($tokenId) {

                    case 'literal':
                    case 'n_to_m':
                    case 'zero_or_one':
                        $out .= $tokenValue;
                       break;

                    default:
                        throw new RuntimeException(
                            'Token ID ' . $tokenId . ' is not well-handled.'
                        );
                }
              break;

            default:
                throw new RuntimeException(
                    'Node ID ' . $nodeId . ' is not well-handled.'
                );
        }

        return $out;
    }
}

And finally, we apply the pretty printer on the AST like previously seen:

$compiler    = Compiler\Llk\Llk::load(
    new File\Read('hoa://Library/Regex/Grammar.pp')
);
$ast         = $compiler->parse('ab(c|d){2,4}e?');
$prettyprint = new PrettyPrinter();
echo $prettyprint->visit($ast);

Et voilà !

Now, put all that stuff together!

Isotropic generation

We can use Hoa\Regex and Hoa\Compiler to get the AST of any regular expressions written in the PCRE format. We can use Hoa\Visitor to traverse the AST and apply computations according to the type of nodes. Our goal is to generate strings based on regular expressions. What kind of generation are we going to use? There are plenty of them: uniform random, smallest, coverage based…

The simplest is isotropic generation, also known as random generation. But random says nothing: what is the repartition, or do we have any uniformity? Isotropic means each choice will be solved randomly and uniformly. Uniformity has to be defined: does it include the whole set of nodes or just the immediate children of the node? Isotropic means we consider only immediate children. For instance, a node #alternation has \displaystyle c^1 immediate children, the probability \displaystyle C to choose one child is:

\displaystyle P(C) = \frac{1}{c^1}

Yes, simple as that!

We can use the Hoa\Math library that provides the Hoa\Math\Sampler\Random class to sample uniform random integers and floats. Ready?

Structure of the visitor

The structure of the visitor is the following:

use Hoa\Visitor;
use Hoa\Math;

class IsotropicSampler implements Visitor\Visit {

    protected $_sampler = null;

    public function __construct ( Math\Sampler $sampler ) {

        $this->_sampler = $sampler;

        return;
    }

    public function visit ( Visitor\Element $element,
                            &$handle = null,
                            $eldnah  = null ) {

        switch($element->getId()) {

            // …
        }
    }
}

We set a sampler and we start visiting and filtering nodes by their node ID. The following code will generate a string based on the regular expression contained in the $expression variable:

$expression  = '…';
$ast         = $compiler->parse($expression);
$generator   = new IsotropicSampler(new Math\Sampler\Random());
echo $generator->visit($ast);

We are going to change the value of $expression step by step until having ab(c|d){2,4}e?.

Case of #expression

A node of type #expression has only one child. Thus, we simply return the computation of this node:

case '#expression':
    return $element->getChild(0)->accept($this, $handle, $eldnah);
  break;

Case of token

We consider only one type of token for now: literal. A literal can contain an escaped character, can be a single character or can be . (which means everything). We consider only a single character for this example (spoil: the whole visitor already exists). Thus:

case 'token':
    return $element->getValueValue();
  break;

Here, with $expression = 'a'; we get the string a.

Case of #concatenation

A concatenation is just the computation of all children joined in a single piece of string. Thus:

case '#concatenation':
    $out = null;

    foreach($element->getChildren() as $child)
        $out .= $child->accept($this, $handle, $eldnah);

    return $out;
  break;

At this step, with $expression = 'ab'; we get the string ab. Totally crazy.

Case of #alternation

An alternation is a choice between several children. All we have to do is to select a child based on the probability given above. The number of children for the current node can be known thanks to the getChildrenNumber method. We are also going to use the sampler of integers. Thus:

case '#alternation':
    $childIndex = $this->_sampler->getInteger(
        0,
        $element->getChildrenNumber() - 1
    );

    return $element->getChild($childIndex)
                   ->accept($this, $handle, $eldnah);
  break;

Now, with $expression = 'ab(c|d)'; we get the strings abc or abd at random. Try several times to see by yourself.

Case of #quantification

A quantification is an alternation of concatenations. Indeed, e{2,4} is strictly equivalent to ee|eee|eeee. We have only two quantifications in our example: ? and {x,y}. We are going to find the value for x and y and then choose at random between these bounds. Let’s go:

case '#quantification':
    $out = null;
    $x   = 0;
    $y   = 0;

    // Filter the type of quantification.
    switch($element->getChild(1)->getValueToken()) {

        // ?
        case 'zero_or_one':
            $y = 1;
          break;

        // {x,y}
        case 'n_to_m':
            $xy = explode(
                ',',
                trim($element->getChild(1)->getValueValue(), '{}')
            );
            $x  = (int) trim($xy[0]);
            $y  = (int) trim($xy[1]);
          break;
    }

    // Choose the number of repetitions.
    $max = $this->_sampler->getInteger($x, $y);

    // Concatenate.
    for($i = 0; $i < $max; ++$i)
        $out .= $element->getChild(0)->accept($this, $handle, $eldnah);

    return $out;
  break;

Finally, with $expression = 'ab(c|d){2,4}e?'; we can have the following strings: abdcce, abdc, abddcd, abcde etc. Nice isn’t it? Want more?

for($i = 0; $i < 42; ++$i)
    echo $generator->visit($ast), "\n";

/**
 * Could output:
 *     abdce
 *     abdcc
 *     abcdde
 *     abcdcd
 *     abcde
 *     abcc
 *     abddcde
 *     abddcce
 *     abcde
 *     abcc
 *     abdcce
 *     abcde
 *     abdce
 *     abdd
 *     abcdce
 *     abccd
 *     abdcdd
 *     abcdcce
 *     abcce
 *     abddc
 */

Performance

This is difficult to give numbers because it depends of a lot of parameters: your machine configuration, the PHP VM, if other programs run etc. But I have generated 1 million ( \displaystyle 10^6 ) strings in less than 25 seconds on my machine (an old MacBook Pro), which is pretty reasonable.

Time (in milliseconds) to generate a certain number of strings (log-scaled).

Conclusion and surprise

So, yes, now we know how to generate strings based on regular expressions! Supporting all the PCRE format is difficult. That’s why the Hoa\Regex library provides the Hoa\Regex\Visitor\Isotropic class that is a more advanced visitor. This latter supports classes, negative classes, ranges, all quantifications, all kinds of literals (characters, escaped characters, types of characters —\w, \d, \h…—) etc. Consequently, all you have to do is:

use Hoa\Regex;

// …
$generator = new Regex\Visitor\Isotropic(new Math\Sampler\Random());
echo $generator->visit($ast);

This algorithm is used in Praspel, a specification language I have designed during my PhD thesis. More specifically, this algorithm is used inside realistic domains. I am not going to explain it today but it allows me to introduce the “surprise”.

Generate strings based on regular expressions in atoum

atoum is an awesome unit test framework. You can use the Atoum\PraspelExtension extension to use Praspel and therefore realistic domains inside atoum. You can use realistic domains to validate and to generate data, they are designed for that. Obviously, we can use the Regex realistic domain. This extension provides several features including sample, sampleMany and predicate to respectively generate one datum, generate many data and validate a datum based on a realistic domain. To declare a regular expression, we must write:

$regex = $this->realdom->regex('/ab(c|d){2,4}e?/');

And to generate a datum, all we have to do is:

$datum = $this->sample($regex);

For instance, imagine you are writing a test called test_mail and you need an email address:

public function test_mail ( ) {

    $this
        ->given(
            $regex   = $this->realdom->regex('/[\w\-_]+(\.[\w\-\_]+)*@\w\.(net|org)/'),
            $address = $this->sample($regex),
            $mailer  = new \Mock\Mailer(),
        )
        ->when($mailer->sendTo($address))
        ->then
            ->}

Easy to read, fast to execute and help to focus on the logic of the test instead of test data (also known as fixtures). Note that most of the time the regular expressions are already in the code (maybe as constants). It is therefore easier to write and to maintain the tests.

I hope you enjoyed this first part of the series :-)! This work has been published in the International Conference on Software Testing, Verification and Validation: Grammar-Based Testing using Realistic Domains in PHP.

Rüsh Release

Since 2 years, at Hoa, we are looking for the perfect release process. Today, we have finalized the last thing related to this new process: we have found a name. It is called Rüsh Release, standing for Rolling Ünd ScHeduled Release.

The following explanations are useful from the user point of view, not from the developer point of view. It means that we do not explain all the branches and the workflow between all of them. We will settle for the user final impact.

Rolling Release

On one hand, Hoa is not and will never be finished. We will never reach the “Holy 1.0 Grail”. So, one might reckon that Hoa is rolling-released? Let’s dive into this direction. There are plenty rolling release types out there, such as:

  • partially rolling,
  • fully rolling,
  • truly rolling,
  • pseudo-rolling,
  • optionally rolling,
  • cyclically rolling,
  • and synonyms…

I am not going to explain all of them. All you need to know is that Hoa is partly and truly rolling released, or part- and true-rolling released for short. Why? Firstly, “Part-rolling [project] has a subset of software packages that are not rolling”. If we look at Hoa only, it is fully rolling but Hoa depends on PHP virtual machines to be executed, which are not rolling released (for the most popular ones at least). Thus, Hoa is partly rolling released. Secondly, “True-rolling [project] are developed solely using a rolling release software development model”, which is the case of Hoa. Consequently and finally, the master branch is the final public branch, it means that it always contains the latest version, and users constantly fetch updates from it.

Versioning

Sounds good. On the other hand, the majority of programs that are using Hoa use tools called dependency managers. The most popular in PHP is Composer. This is a fantastic tool but with a little spine that hurts us a lot: it does not support rolling release! Most of the time, dependency managers work with version numbers, mainly of the form x.y.z, with a specific semantics for x, y and z. For instance, some people have agreed about semver, standing for Semantic Versioning.

Also, we are not extremist. We understand the challenges and the needs behind versioning. So, how to mix both: rolling release and versioning? Before answering this question, let’s progress a little step forward and learn more about an alternative versioning approach.

Scheduled-based release

Scheduled-based, also known as date-based, release allows to define releases at regular periods of time. This approach is widely adopted for projects that progress quickly, such as Firefox or PHP (see the PHP RFC: Release Process for example). For Firefox, every 6 weeks, a new version is released. Note that we should say a new update to be honest: the term version has an unclear meaning here.

The scheduled-based release seems a good candidate to be mixed with rolling release, isn’t it?

Rüsh Release

Rüsh Release is a mix between part- and true-rolling release and scheduled-based release. The master branch is part- and true-rolling release, but with a semi-automatically versioning:

  • each 6 weeks, if at least one new patch has been merged into the master, a new version is created,
  • before 6 weeks, if several critical or significant patches have been applied, a new version is created.

What is the version format then? We have proposed YY{2,4}.mm.dd, starting from 2000, our “Rüsh Epoch”.

Nevertheless, we are not infallible and we can potentially break backward compatibility. It never happened but we have to face it. This is a problem because neither the part- and true-rolling release nor the scheduled-based release holds the information that the backward compatibility has been broken. Therefore, the master branch must have a compatibility number x, starting from 1 with step of 1. Consequently, the new and last version format is x.Y{2,4}.mm.dd. For today for instance, it is 1.14.09.15.

With the Rüsh Release process, we can freely rolling release our libraries while ensuring the safety and embracing the pros of versioning.

So, now, you will be able to change your composer.json files from:

{
    "require": {
        "hoa/websocket": "dev-master"
    },
    "minimum-stability": "dev"
}

to (learn more about the tilde operator):

{
    "require": {
        "hoa/websocket": "~1.0"
    }
}

\o/

Hello, World!

This is my first post of my first public and Computer Science-oriented blog. When I started Hoa few years ago, along with other projects, they all were mainly a pretext to learn about numerous domains of Computer Science. All the acquired knowledges have taken the form of libraries or programs, go hand in hand with detailed documentations and README. However, while I always like to write an introduction to my documentation chapters (examples with Hoa\Websocket or Hoa\Compiler), these are not the place where to write the making-of or the behind the scenes, i.e. all the implementation details and choices. Consequently, this blog will be Computer Science-oriented and especially technical-oriented.

Moreover, this blog will be a place where I can freely talk about the future of projects where I am involved. For instance, we are going to land a new special release system for Hoa and the official blog of Hoa is not the place where to make pre-announcements. Similarly, Hoa’s blog is still not the place where to talk about current works and advancements. Consequently, this blog will fill those needs.

To conclude:

echo 'Hello world!';