atoum is a popular PHP test framework. TeamCity is a Continuous Integration and Continuous Delivery software developed by Jetbrains. Despites atoum supports many industry standards to report test execution verdicts, TeamCity uses its own non-standard report, and thus atoum is not compatible with TeamCity… until now.

The atoum/teamcity-extension provides TeamCity support inside atoum. When executing tests, the reported verdicts are understandable by TeamCity, and activate all its UI features.

## Install

If you have Composer, just run:

$composer require atoum/teamcity-extension '~1.0' From this point, you need to enable the extension in your .atoum.php configuration file. The following example forces to enable the extension for every test execution: $extension = new atoum\teamcity\extension($script);$extension->addToRunner($runner); The following example enables the extension only within a TeamCity environment: $extension = new atoum\teamcity\extension($script);$extension->addToRunnerWithinTeamCityEnvironment($runner); This latter installation is recommended. That’s it 🙂. ## Glance The default CLI report looks like this: The TeamCity report looks like this in your terminal (note the TEAMCITY_VERSION variable as a way to emulate a TeamCity environment): Which is less easy to read. However, when it comes into TeamCity UI, we will have the following result: We are using it at Automattic. Hope it is useful for someone else! If you find any bugs, or would like any other features, please use Github at the following repository: https://github.com/Hywan/atoum-teamcity-extension/. # Export functions in PHP à la Javascript Warning: This post is totally useless. It is the result of a fun private company thread. ## Export functions in Javascript In Javascript, a file can export functions like this: export function times2(x) { return x * 2; } And then we can import this function in another file like this: import {times2} from 'foo'; console.log(times2(21)); // 42 Is it possible with PHP? ## Export functions in PHP Every entity is public in PHP: Constant, function, class, interface, or trait. They can live in a namespace. So exporting functions in PHP is absolutely useless, but just for the fun, let’s keep going. A PHP file can return an integer, a real, an array, an anonymous function, anything. Let’s try this: <?php return function (int$x): int {
return $x * 2; }; And then in another file: <?php$times2 = require 'foo.php';
var_dump($times2(21)); // int(42) Great, it works. What if our file returns more than one function? Let’s use an array (which has most hashmap properties): <?php return [ 'times2' => function (int$x): int {
return $x * 2; }, 'answer' => function (): int { return 42; } ]; To choose what to import, let’s use the list intrinsic. It has several forms: With or without key matching, long (list(…)) and short syntax ([…]). Because we are modern, we will use the short syntax with key matching to selectively import functions: <?php ['times2' =>$mul] = require 'foo.php';

var_dump($mul(21)); // int(42) Notice that times2 has been aliased to $mul. What a feature!

Is it useful? Absolutely not. Is it fun? For me it is.

# Finite-State Machine as a Type System illustrated with a store product

Hello fellow coders!

In this article, I would like to talk about how to implement a Finite-State Machine (FSM) with the PHP type system. The example is a store product (in an e-commerce solution for instance), something we are likely to meet once in our lifetime. Our goal is to simply avoid impossible states and transitions.

I am in deep love with Type theory, however I will try to keep the formulas away from this article to focus on the code. Moreover, you might be aware that the PHP runtime type system is somewhat very permissive and “poor” (this is not a formal definition), hopefully some tricks can help us to express nice constraints.

## The Product FSM

A product in a store might have the following states:

• Active: Can be purchased,
• Inactive: Has been cancelled or discontinued (a discontinued product can no longer be purchased),
• Purchased and renewable,
• Purchased and not renewable,
• Purchased and cancellable.

The transitions between these states can be viewed as a Finite-State Machine (FSM).

We read this graph as: A product is in the state A. If the purchase action is called, then it transitions to the state B. If the once-off purchase action is called, then it transitions to the state C. From the state B, if the renew action is called, it remains in the same state. If the cancel action is called, it transitions to the D state. Same for the C to D states.

Our goal is to respect this FSM. Invalid actions must be impossible to do.

## Finite-State Machine as a Type System

Having a FSM is a good thing to define the states and the transitions between them: It is formal and clear. However, it is tested at runtime, not at compile-time, i.e. if statements are required to test if the state of a product can transition into another state, or else throw an exception, and this is decided at runtime. Note that PHP does not really have a compile-time because it is an online compiler (learn more by reading Tagua VM, a safe PHP virtual machine, at slide 29). Our goal is to prevent illegal/invalid states at parse-/compile-time so that the PHP virtual machine, IDE or static analysis tools can prove the state of a product without executing PHP code.

Why is this important? Imagine that we decide to change a product to be once-off purchasable instead of purchasable, then we can no longer renew it. We replace an interface on this product, and boom, the IDE tells us that the code is broken in x places. It detects impossible scenarios ahead of code execution.

No more talking. Here is the code.

### The mighty product

/**
* A product.
*/
interface Product { }

A product is a class implementing the Product interface. It allows to type a generic product, with no regards about its state.

### Active and inactive

/**
* A product that is active.
*/
interface Active extends Product
{
public function getProduct(): self;
}

/**
* A product that has been cancelled, or not in stock.
*/
interface Inactive extends Product
{
public function getProduct(): self;
}

The Active and Inactive interfaces are useful to create constraints such as:

• A product can be purchased only if it is active, and
• A product is inactive if and only if it has been cancelled,
• To finally conclude that an inactive product can no longer be purchased, nor renewed, nor cancelled.

Basically, it defines the axiom (initial state) and the final states of our FSM.

The getProduct(): self trick will make sense later. It helps to express the following constraint: “A valid product cannot be invalid, and vice-versa”, i.e. both interfaces cannot be implemented by the same value.

### Purchase, renew, and cancel

/**
* A product that can be purchased.
*/
interface Purchasable extends Active
{
public function purchase(): Renewable;
}

Only an active product can be purchased. The action is purchase and it generates a product that is renewable. purchase transitions from the state A to B (regarding the graph above).

/**
* A product that can be cancelled.
*/
interface Cancellable extends Active
{
public function cancel(): Inactive;
}

Only an active product can be cancelled. The action is cancel and it generates an inactive product, so it transitions from the state B to D.

/**
* A product that can be renewed.
*/
interface Renewable extends Cancellable
{
public function renew(): self;
}

A renewable product is also cancellable. The action is renew and this is a reflexive transition from the state B to B.

/**
* A product that can be once-off purchased, i.e. it can be purchased but not
* renewed.
*/
interface PurchasableOnce extends Active
{
public function purchase(): Cancellable;
}

Finally, a once-off purchasable product has one action: purchase that produces a Cancellable product, and it transitions from the state A to C.

### Take a breath

So far we have defined interfaces, but the FSM is not implemented yet. Interfaces only define constraints in our type system. An interface provides a constraint but also defines type capabilities: What operations can be performed on a value implementing a particular interface.

### SecretProduct

Let’s consider the SecretProduct as a new super secret product that will revolutionise our store:

/**
* The SecretProduct class is:
*
*   * A product,
*   * Active,
*   * Purchasable.
*
* Note that in this implementation, the SecretProduct instance is mutable: Every
* action happens on the same SecretProduct instance. It makes sense because
* having 2 instances of the same product with different states might be error-prone
* in most scenarios.
*/
class SecretProduct implements Active, Purchasable
{
public function getProduct(): Active
{
return $this; } /** * Purchase the product will return an active product that is renewable, * and also cancellable. */ public function purchase(): Renewable { return new class ($this->getProduct()) implements Renewable {
protected $product; public function __construct(SecretProduct$product)
{
$this->product =$product;
// Do the purchase.
}

public function getProduct(): Active
{
return $this->product; } public function renew(): Renewable { // Do the renew. return$this;
}

public function cancel(): Inactive
{
return new class ($this->getProduct()) implements Inactive { protected$product;

public function __construct(SecretProduct $product) {$this->product = $product; // Do the cancel. } public function getProduct(): Inactive { return$this->product;
}
};
}
};
}
}

The SecretProduct is a product that is active and purchasable. PHP verifies that the Active::getProduct method is implemented, and that the Purchasable::purchase method is implemented too.

When this latter is called, it returns an object implementing the Renewable interface (which is also a cancellable active product). The object in this context is an instance of an anonymous class implementing the Renewable interface. So the Active::getProduct, Renewable::renew, and Cancellable::cancel methods must be implemented.

Having an anonymous class is not required at all, this is just simpler for the example. A named class may even be better from the testing point of view.

Note that:

• The real purchase action is performed in the constructor of the anonymous class: This is not a hard rule, this is just convenient; it can be done in the method before returning the new instance,
• The real renew action is performed in the renew method before returning $this, • And the real cancel action is performed in… we have to dig a little bit more (the principle is exactly the same though): • The Cancellable::cancel method must return an object implementing the Inactive interface. • It generates an instance of an anonymous class implementing the Inactive interface, and the real cancel action is done in the constructor. ### Assert possible and impossible actions Let’s try some valid and invalid actions. Those followings are possible actions: assert((new SecretProduct())->purchase() instanceof Product); assert((new SecretProduct())->purchase()->renew() instanceof Product); assert((new SecretProduct())->purchase()->cancel() instanceof Product); assert((new SecretProduct())->purchase()->renew()->renew()->cancel() instanceof Product); It is possible to purchase a product, then renew it zero or many times, and finally to cancel it. It matches the FSM! Those followings are impossible actions: (new SecretProduct())->renew(); (new SecretProduct())->cancel(); (new SecretProduct())->purchase()->cancel()->purchase(); (new SecretProduct())->purchase()->cancel()->renew(); (new SecretProduct())->purchase()->purchase(); (new SecretProduct())->purchase()->cancel()->cancel(); It is impossible: • To renew or to cancel a product that has not been purchased, • To purchase or renew a product that has been cancelled, • To purchase a product more than once, • To cancel a product more than once. Those followings are impossible implementations: class SecretProduct implements Active, Purchasable, PurchasableOnce { } A product cannot be purchasable and once-off purchasable at the same time, because Purchasable::purchase is not compatible with PurchasableOnce::purchase. class SecretProduct implements Inactive, Cancellable { } An inactive product cannot be purchased nor renewed nor cancelled because Active::getProduct and Inactive::getProduct are not compatible. Wow, that’s great garantees isn’t it? PHP will raise fatal errors for impossible actions or impossible states. No warnings or notices: Fatal errors. Most of them are correctly inferred by IDE, so… follow the red crosses in your IDE. ## Restoring a product One major thing is missing: The state of a product is stored in the database. When loading the product, we must be able to get an instance of a product at its previous state. To avoid repeating code, we will use traits. Rebuilding the state of a product is “just” (it really is) a composition of traits. Note: In these examples, we are using anonymous classes and traits. It is possible to achieve the same behavior with final named classes. Also we are using a repository, which is convenient for this article, but not necessarily the best solution. ### Repository The following ProductRepository\load function is just here to give you an idea of how it works. namespace ProductRepository; function load(int$id, string $state): Product { // Load the product from the database with $id.
//
// The states can be Renewable, Cancellable, or Inactive (check
// the FSM to double-check). Products that have not been purchased
// are not in the database.

// Fake minimal active product.
$product = new class implements Active { public function getProduct(): Active { return$this;
}
};

switch ($state) { // State B. case Renewable::class: return new class ($product) implements Renewable {
use ActiveProduct;
use RenewableProduct;
use CancellableProduct;
};

// State C.
case Cancellable::class:
return new class ($product) implements Cancellable { use ActiveProduct; use CancellableProduct; }; // State D. case Inactive::class: return new class ($product) implements Inactive {
use InactiveProduct;
};

// Invalid state.
default:
throw new RuntimeException('Invalid product state.');
}
}

### Traits

The code must look familiar because this is just a split from the SecretProduct implementation.

trait ActiveProduct
{
protected $product; public function __construct(Product$product)
{
$this->product =$product;
}

public function getProduct(): Active
{
return $this->product; } } trait RenewableProduct { public function renew(): Renewable { // Do the renew. return$this;
}
}

trait CancellableProduct
{
public function cancel(): Inactive
{
return new class ($this->getProduct()) implements Inactive { protected$product;

public function __construct(Product $product) {$this->product = $product; // Do the cancel. } public function getProduct(): Inactive { return$this->product;
}
};
}
}

trait InactiveProduct
{
protected $product; public function __construct(Product$product)
{
$this->product =$product;
}

public function getProduct(): Inactive
{
return $this->product; } } ### Assert possible and impossible actions The possible actions are: $product = ProductRepository\load(42, Renewable::class);

assert($product instanceof Product); assert($product->renew()  instanceof Product);
assert($product->cancel() instanceof Product);  Product 42 is assumed to be in the state B (Renewable::class), so we can renew and cancel it. Those followings are impossible actions: $product = ProductRepository\load(42, Renewable::class);

$product->purchase();$product->cancel()->cancel();

It is impossible to purchase the product 42 because it is in state B, so it has already been purchased. It is impossible to cancel a product twice.

Same garantees apply here!

## Conclusion

It is possible to re-implement SecretProduct with the traits we have defined for the ProductRepository, or to use named classes. I let this as an easy wrap up exercise for the reader.

The real conclusion is that we have successfully implemented the Finite-State Machine of a product with a Type System. It is impossible to have an invalid implementation that violates the constraints, such as an inactive renewable product. PHP detects it immediately at runtime. Invalid actions are also impossible, such as purchasing a product twice, or renewing a once-off purchased product. It is also detected by PHP.

All violations take the form of PHP fatal errors.

The product repository is an example of how to restore a product at a particular state, with the help of the defined interfaces, and new small and simple traits.

## One more thing

It is possible to integrate product categories in this type system (like bundles). It is more complex, but possible.

I would highly recommend these following readings:

I would like to particularly emphasize a paragraph from the first article:

So what is a type? The only true definition is this: a type is a label used by a type system to prove some property of the program’s behavior. If the type checker can assign types to the whole program, then it succeeds in its proof; otherwise it fails and points out why it failed.

Seeing types as labels is a very smart way of approaching them.

I would like to thanks Marco Pivetta for the reviews!

# sabre/katana

## What is it?

sabre/katana is a contact, calendar, task list and file server. What does it mean? Assuming nowadays you have multiple devices (PC, phones, tablets, TVs…). If you would like to get your address books, calendars, task lists and files synced between all these devices from everywhere, you need a server. All your devices are then considered as clients.

But there is an issue with the server. Most of the time, you might choose Google or maybe Apple, but one may wonder: Can we trust these servers? Can we give them our private data, like all our contacts, our calendars, all our photos…? What if you are a company or an association and you have sensitive data that are really private or strategic? So, can you still trust them? Where the data are stored? Who can look at these data? More and more, there is a huge need for “personal” server.

Moreover, servers like Google or Apple are often closed: You reach your data with specific clients and they are not available in all platforms. This is for strategic reasons of course. But with sabre/katana, you are not limited. See the above schema: Firefox OS can talk to iOS or Android at the same time.

sabre/katana is this kind of server. You can install it on your machine and manage users in a minute. Each user will have a collection of address books, calendars, task lists and files. This server can talk to a loong list of devices, mainly thanks to a scrupulous respect of industrial standards:

• Mac OS X:
• OS X 10.10 (Yosemite),
• OS X 10.9 (Mavericks),
• OS X 10.8 (Mountain Lion),
• OS X 10.7 (Lion),
• OS X 10.6 (Snow Leopard),
• OS X 10.5 (Leopard),
• BusyCal,
• BusyContacts,
• Fantastical,
• Rainlendar,
• ReminderFox,
• SoHo Organizer,
• Spotlife,
• Thunderbird ,
• Windows:
• eM Client,
• Microsoft Outlook 2013,
• Microsoft Outlook 2010,
• Microsoft Outlook 2007,
• Microsoft Outlook with Bynari WebDAV Collaborator,
• Microsoft Outlook with iCal4OL,
• Rainlendar,
• ReminderFox,
• Thunderbird,
• Linux:
• Evolution,
• Rainlendar,
• ReminderFox,
• Thunderbird,
• Mobile:
• Android,
• BlackBerry 10,
• BlackBerry PlayBook,
• Firefox OS,
• iOS 8,
• iOS 7,
• iOS 6,
• iOS 5,
• iOS 4,
• iOS 3,
• Nokia N9,
• Sailfish.

Did you find your device in this list? Probably yes 😉.

sabre/katana sits in the middle of all your devices and synced all your data. Of course, it is free and open source. Go check the source!

## List of features

Here is a non-exhaustive list of features supported by sabre/katana. Depending whether you are a user or a developer, the features that might interest you are radically not the same. I decided to show you a list from the user point of view. If you would like to get a list from the developer point of view, please see this exhaustive list of supported RFC for more details.

### Contacts

All usual fields are supported, like phone numbers, email addresses, URLs, birthday, ringtone, texttone, related names, postal addresses, notes, HD photos etc. Of course, groups of cards are also supported.

My photo is not in HD, I really have to update it!

Cards can be encoded into several formats. The most usual format is VCF. sabre/katana allows you to download the whole address book of a user as a single VCF file. You can also create, update and delete address books.

### Calendars

A calendar is just a set of events. Each event has several properties, such as a title, a location, a date start, a date end, some notes, URLs, alarms etc. sabre/katana also support recurring events (“each last Monday of the month, at 11am…”), in addition to scheduling (see bellow).

Few words about calendar scheduling. Let’s say you are organizing an event, like New release (we always enjoy release day!). You would like to invite several people but you don’t know if they could be present or not. In your event, all you have to do is to add attendees. How are they going to be notified about this event? Two situations:

1. Either attendees are registered on your sabre/katana server and they will receive an invite inside their calendar application (we call this iTIP),
2. Or they are not registered on your server and they will receive an email with the event as an attached file (we call this iMIP). All they have to do is to open this event in their calendar application.

Notice the gorgeous map embedded inside the email!

Once they received the event, they can accept, decline or “don’t know” (they will try to be present at) the event.

, or

Of course, attendees will be notified too if the event has been moved, canceled, refreshed etc.

Calendars can be encoded into several formats. The most usal format is ICS. sabre/katana allows you to download the whole calendar of a user as a single ICS file. You can also create, update and delete calendars.

A task list is exactly like a calendar (from a programmatically point of view). Instead of containg event objects, it contains todo objects.

sabre/katana supports group of tasks, reminder, progression etc.

Just like calendars, task lists can be encoded into several formats, whose ICS. sabre/katana allows you to download the whole task list of a user as a single ICS file. You can also create, update and delete task lists.

### Files

Finally, sabre/katana creates a home collection per user: A personal directory that can contain files and directories and… synced between all your devices (as usual 😄).

sabre/katana also creates a special directory called public/ which is a public directory. Every files and directories stored inside this directory are accessible to anyone that has the correct link. No listing is prompted to protect your public data.

Just like contact, calendar and task list applications, you need a client application to connect to your home collection on sabre/katana.

Then, your public directory on sabre/katana will be a regular directory as every other.

sabre/katana is able to store any kind of files. Yes, any kinds. It’s just files. However, it white-lists the kind of files that can be showed in the browser. Only images, audios, videos, texts, PDF and some vendor formats (like Microsoft Office) are considered as safe (for the server). This way, associations can share musics, videos or images, companies can share PDF or Microsoft Word documents etc. Maybe in the future sabre/katana might white-list more formats. If a format is not white-listed, the file will be forced to download.

## How is sabre/katana built?

sabre/katana is based on two big and solid projects:

sabre/dav is one of the most powerful CardDAV, CalDAV and WebDAV framework in the planet. Trusted by the likes of Atmail, Box, fruux and ownCloud, it powers millions of users world-wide! It is written in PHP and is open source.

Hoa is a modular, extensible and structured set of PHP libraries. Fun fact: Also open source, this project is also trusted by ownCloud, in addition to Mozilla, joliCode etc. Recently, this project has recorded more than 600,000 downloads and the community is about to reach 1000 people.

sabre/katana is then a program based on sabre/dav for the DAV part and Hoa for everything else, like the logic code inside the sabre/dav‘s plugins. The result is a ready-to-use server with a nice interface for the administration.

To ensure code quality, we use atoum, a popular and modern test framework for PHP. So far, sabre/dav has more than 1000 assertions.

## Conclusion

sabre/katana is a server for contacts, calendars, task lists and files. Everything is synced, everytime and everywhere. It perfectly connects to a lot of devices on the market. Several features we need and use daily have been presented. This is the easiest and a secure way to host your own private data.

# RFCs should provide executable test suites

Recently, I implemented xCal and xCard formats inside the sabre/dav libraries. While testing the different RFCs against my implementation, several errata have been found. This article, first, quickly list them and, second, ask questions about how such errors can be present and how they can be easily revealed. If reading my dry humor about RFC errata is boring, the Sections 3, 4 and 5 are more interesting. The whole idea is: Why RFCs do not provide executable test suites?

## What is xCal and xCard?

The Web is a read-only media. It is based on the HTTP protocol. However, there is the WebDAV protocol, standing for Web Distributed Authoring and Versioning. This is an extension to HTTP. Et voilà ! The Web is a read and write media. WebDAV is standardized in RFC2518 and RFC4918.

Based on WebDAV, we have CalDAV and CardDAV, respectively for reading and writing calendars and addressbooks. They are standardized in RFC4791, RFC6638 and RFC6352. Good! But these protocols only explain how to read and write, not how to represent a real calendar or an addressbook. So let’s leave protocols for formats.

The iCalendar format represents calendar events, like events (VEVENT), tasks (VTODO), journal entry (VJOURNAL, very rare…), free/busy time (VFREEBUSY) etc. The vCard format represents cards. The formats are very similar and share a common ancestry: This is a horrible line-, colon- and semicolon-, randomly-escaped based format. For instance:

BEGIN:VCALENDAR
VERSION:2.0
CALSCALE:GREGORIAN
PRODID:-//Example Inc.//Example Calendar//EN
BEGIN:VEVENT
DTSTAMP:20080205T191224Z
DTSTART;VALUE=DATE:20081006
SUMMARY:Planning meeting
END:VEVENT
END:VCALENDAR

Horrible, yes. You were warned. These formats are standardized in several RFCs, to list some of them: RFC5545, RFC2426 and RFC6350.

This format is impossible to read, even for a computer. That’s why we have jCal and jCard, which are respectively another representation of iCalendar and vCard but in JSON. JSON is quite popular in the Web today, especially because it eases the manipulation and exchange of data in Javascript. This is just a very simple, and —from my point of view— human readable, serialization format. jCal and jCard are respectively standardized in RFC7265 and RFC7095. Thus, the equivalent of the previous iCalendar example in jCal is:

[
"vcalendar",
[
["version", {}, "text", "2.0"],
["calscale", {}, "text", "GREGORIAN"],
["prodid", {}, "text", "-\/\/Example Inc.\/\/Example Calendar\/\/EN"]
],
[
[
"vevent",
[
["dtstamp", {}, "date-time", "2008-02-05T19:12:24Z"],
["dtstart", {}, "date", "2008-10-06"],
["summary", {}, "text", "Planning meeting"],
]
]
]
]

Much better. But this is JSON, which is a rather loose format, so we also have xCal and xCard another representation of iCalendar and vCard but in XML. They are standardized in RFC6321 and RFC6351. The same example in xCal looks like this:

<icalendar xmlns="urn:ietf:params:xml:ns:icalendar-2.0">
<vcalendar>
<properties>
<version>
<text>2.0text>
version>
<calscale>
<text>GREGORIANtext>
calscale>
<prodid>
<text>-//Example Inc.//Example Calendar//ENtext>
prodid>
properties>
<components>
<vevent>
<properties>
<dtstamp>
<date-time>2008-02-05T19:12:24Zdate-time>
dtstamp>
<dtstart>
<date>2008-10-06date>
dtstart>
<summary>
<text>Planning meetingtext>
summary>
<uid>
uid>
properties>
vevent>
components>
vcalendar>
icalendar>

More semantics, more meaning, easier to read (from my point of view), namespaces… It is very easy to embed xCal and xCard inside other XML formats.

Managing all these formats is an extremely laborious task. I suggest you to take a look at sabre/vobject (see the Github repository of sabre/vobject). This is a PHP library to manage all the weird formats. The following example shows how to read from iCalendar and write to jCal and xCal:

// Read iCalendar.
$document = Sabre\VObject\Reader::read($icalendar);

// Write jCal.
echo Sabre\VObject\Writer::writeJson($document); // Write xCal. echo Sabre\VObject\Writer::writeXml($document);

Magic when you know the complexity of these formats (in both term of parsing and validation)!

## List of errata

Now, let’s talk about all the errata I submited recently:

The 2 last ones are reported, not yet verified.

4241, 4243 and 4246 are just typos in examples. “just” is a bit of an under-statement when you are reading RFCs for days straight, you have 10 of them opened in your browser and trying to figure out how everything fits together and if you are doing everything correctly. Finding typos at that point in your process can be very confusing…

4247 is more subtle. The RFC about xCard comes with an XML Schema. That’s great! It will help us to test our documents and see if they are valid or not! No? No.

Most of the time, I try to relax and deal with the incoming problems. But the date and time format in iCalendar, vCard, jCal, jCard, xCal and xCard can make my blood boil in a second. In what world, exactly, --10 or ---28 is a conceivable date and time format? How long did I sleep? “Well” — was I saying to myself, “do not make a drama, we have the XML Schema!”. No. Because there is an error in the schema. More precisely, in a regular expression:

value-time = element time {
xsd:string { pattern = "(\d\d(\d\d(\d\d)?)?|-\d\d(\d\d?)|--\d\d)"
~ "(Z|[+\-]\d\d(\d\d)?)?" }
}

Did you find the error? (\d\d?) is invalid, this is (\d\d)?. Don’t get me wrong: Everyone makes mistakes, but not this kind of error. I will explain why in the next section.

4245 is not an editorial error but a technical one, under review.

4261 is crazy. It deserves a whole sub-section.

### Welcome in the crazy world of date and time formats

There are two major popular date and time format: RFC2822 and ISO.8601. Examples:

• Fri, 27 Feb 2015 16:06:58 +0100 and
• 2015-02-27T16:07:16+01:00.

The second one is a good candidate for a computer representation: no locale, only digits, all information are present…

Maybe you noticed there is no link on ISO.8601. Why? Because ISO standards are not free and I don’t want to pay 140€ to buy a standard…

The date and time format adopted by iCalendar and vCard (and the rest of the family) is ISO.8601.2004. I cannot read it. However, since we said in xCard we have an XML Schema; we can read this (after having applied erratum 4247):

# 4.3.1
value-date = element date {
xsd:string { pattern = "\d{8}|\d{4}-\d\d|--\d\d(\d\d)?|---\d\d" }
}

# 4.3.2
value-time = element time {
xsd:string { pattern = "(\d\d(\d\d(\d\d)?)?|-\d\d(\d\d)?|--\d\d)"
~ "(Z|[+\-]\d\d(\d\d)?)?" }
}

# 4.3.3
value-date-time = element date-time {
xsd:string { pattern = "(\d{8}|--\d{4}|---\d\d)T\d\d(\d\d(\d\d)?)?"
~ "(Z|[+\-]\d\d(\d\d)?)?" }
}

# 4.3.4
value-date-and-or-time = value-date | value-date-time | value-time

Question: --10 is October or 10 seconds?

--10 can fit into value-date and value-time:

• From value-date, the 3rd element in the disjunction is --\d\d(\d\d)?, so it matches --10,
• From value-time, the last element in the first disjunction is --\d\d, so it matches --10.

If we have a date-and-or-time value, value-date comes first, so --10 is always October. Nevertheless, if we have a time value, --10 is 10 seconds. Crazy now?

Oh, and XML has its own date and time format, which is well-defined and standardized. Why should we drag this crazy format along?

Oh, and I assume every format depending on ISO.8601.2004 has this bug. But I am not sure because ISO standards are not free.

## How can RFCs have such errors?

So far, RFCs are textual standards. Great. But they are just text. Written by humans, and thus they are subject to errors or failures. It is even error-prone. I do not understand: Why an RFC does not come with an executable test suite? I am pretty sure every reader of an RFC will try to create a test suite on its own.

I assume xCal and xCard formats are not yet very popular. Consequently, few people read the RFC and tried to write an implementation. This is my guess. However, it does not avoid the fact an executable test suite should (must?) be provided.

## How did I find them?

This is how I found these errors. I wrote a test suite for xCal and xCard in sabre/vobject. I would love to write a test suite agnostic of the implementation, but I ran out of time. This is basically format transformation: R:xy where R can be a reflexive operator or not (depending of the versions of iCalendar and vCard we consider).

For “simple“ errata, I found the errors by testing it manually. For errata 4247 and 4261 (with the regular expressions), I found the error by applying the algorithms presented in Generate strings based on regular expressions.

## Conclusion

sabre/vobject supports xCal and xCard.

# Control the terminal, the right way

Nowadays, there are plenty of terminal emulators in the wild. Each one has a specific way to handle controls. How many colours does it support? How to control the style of a character? How to control more than style, like the cursor or the window? In this article, we are going to explain and show in action the right ways to control your terminal with a portable and an easy to maintain API. We are going to talk about stat, tput, terminfo, Hoa\Console… but do not be afraid, it’s easy and fun!

## Introduction

Terminals. They are the ancient interfaces, still not old fashioned yet. They are fast, efficient, work remotely with a low bandwidth, secured and very simple to use.

A terminal is a canvas composed of columns and lines. Only one character fits at a position. According to the terminal, we have some features enabled; for instance, a character might be stylized with a colour, a decoration, a weight etc. Let’s consider the former. A colour belongs to a palette, which contains either 2, 8, 256 or more colours. One may wonder:

• How many colours does a terminal support?
• How to control the style of a character?
• How to control more than style, like the cursor or the window?

Well, this article is going to explain how a terminal works and how we interact with it. We are going to talk about terminal capabilities, terminal information (stored in database) and Hoa\Console, a PHP library that provides advanced terminal controls.

## The basis of a terminal

A terminal, or a console, is an interface that allows to interact with the computer. This interface is textual. Like a graphical interface, there are inputs: The keyboard and the mouse, and ouputs: The screen or a file (a real file, a socket, a FIFO, something else…).

There is a ton of terminals. The most famous ones are:

Whatever the terminal you use, inputs are handled by programs (or processus) and outputs are produced by these latters. We said outputs can be the screen or a file. Actually, everything is a file, so the screen is also a file. However, the user is able to use redirections to choose where the ouputs must go.

Let’s consider the echo program that prints all its options/arguments on its output. Thus, in the following example, foobar is printed on the screen:

$echo 'foobar' And in the following example, foobar is redirected to a file called log: $ echo 'foobar' > log

We are also able to redirect the output to another program, like wc that counts stuff:

$echo 'foobar' | wc -c 7 Now we know there are 7 characters in foobar… no! echo automatically adds a new-line (\n) after each line; so: $ echo -n 'foobar' | wc -c
6

This is more correct!

## Detecting type of pipes

Inputs and outputs are called pipes. Yes, trivial, this is nothing more than basic pipes!

There are 3 standard pipes:

• STDIN, standing for the standard input pipe,
• STDOUT, standing for the standard output pipe and
• STDERR, standing for the standard error pipe (also an output one).

If the output is attached to the screen, we say this is a “direct output”. Why is it important? Because if we stylize a text, this is only for the screen, not for a file. A file should receive regular text, not all the decorations and styles.

Hopefully, the Hoa\Console\Console class provides the isDirect, isPipe and isRedirection static methods to know whether the pipe is respectively direct, a pipe or a redirection (damn naming…!). Thus, let Type.php be the following program:

echo 'is direct:      ';
var_dump(Hoa\Console\Console::isDirect(STDOUT));

echo 'is pipe:        ';
var_dump(Hoa\Console\Console::isPipe(STDOUT));

echo 'is redirection: ';
var_dump(Hoa\Console\Console::isRedirection(STDOUT));

Now, let’s test our program:

$php Type.php is direct: bool(true) is pipe: bool(false) is redirection: bool(false)$ php Type.php | xargs -I@ echo @
is direct:      bool(false)
is pipe:        bool(true)
is redirection: bool(false)

$php Type.php > /tmp/foo; cat !!$
is direct:      bool(false)
is pipe:        bool(false)
is redirection: bool(true)

The first execution is very classic. STDOUT, the standard output, is direct. The second execution redirects the output to another program, then STDOUT is of kind pipe. Finally, the last execution redirects the output to a file called /tmp/foo, so STDOUT is a redirection.

How does it work? We use fstat to read the mode of the file. The underlying fstat implementation is defined in C, so let’s take a look at the documentation of fstat(2). stat is a C structure that looks like:

struct stat {
dev_t    st_dev;              /* device inode resides on             */
ino_t    st_ino;              /* inode's number                      */
mode_t   st_mode;             /* inode protection mode               */
uid_t    st_uid;              /* user-id of owner                    */
gid_t    st_gid;              /* group-id of owner                   */
dev_t    st_rdev;             /* device type, for special file inode */
struct timespec st_atimespec; /* time of last access                 */
struct timespec st_mtimespec; /* time of last data modification      */
struct timespec st_ctimespec; /* time of last file status change     */
off_t    st_size;             /* file size, in bytes                 */
quad_t   st_blocks;           /* blocks allocated for file           */
u_long   st_blksize;          /* optimal file sys I/O ops blocksize  */
u_long   st_flags;            /* user defined flags for file         */
u_long   st_gen;              /* file generation number              */
}

The value of mode returned by the PHP fstat function is equal to st_mode in this structure. And st_mode has the following bits:

#define S_IFMT   0170000 /* type of file mask                */
#define S_IFIFO  0010000 /* named pipe (fifo)                */
#define S_IFCHR  0020000 /* character special                */
#define S_IFDIR  0040000 /* directory                        */
#define S_IFBLK  0060000 /* block special                    */
#define S_IFREG  0100000 /* regular                          */
#define S_IFLNK  0120000 /* symbolic link                    */
#define S_IFSOCK 0140000 /* socket                           */
#define S_IFWHT  0160000 /* whiteout                         */
#define S_ISUID  0004000 /* set user id on execution         */
#define S_ISGID  0002000 /* set group id on execution        */
#define S_ISVTX  0001000 /* save swapped text even after use */
#define S_IRWXU  0000700 /* RWX mask for owner               */
#define S_IRUSR  0000400 /* read permission, owner           */
#define S_IWUSR  0000200 /* write permission, owner          */
#define S_IXUSR  0000100 /* execute/search permission, owner */
#define S_IRWXG  0000070 /* RWX mask for group               */
#define S_IRGRP  0000040 /* read permission, group           */
#define S_IWGRP  0000020 /* write permission, group          */
#define S_IXGRP  0000010 /* execute/search permission, group */
#define S_IRWXO  0000007 /* RWX mask for other               */
#define S_IROTH  0000004 /* read permission, other           */
#define S_IWOTH  0000002 /* write permission, other          */
#define S_IXOTH  0000001 /* execute/search permission, other */

Awesome, we have everything we need! We mask mode with S_IFMT to get the file data. Then we just have to check whether it is a named pipe S_IFIFO, a character special S_IFCHR etc. Concretly:

• isDirect checks that the mode is equal to S_IFCHR, it means it is attached to the screen (in our case),
• isPipe checks that the mode is equal to S_IFIFO: This is a special file that behaves like a FIFO stack (see the documentation of mkfifo(1)), everything which is written is directly read just after and the reading order is defined by the writing order (first-in, first-out!),
• isRedirection checks that the mode is equal to S_IFREG, S_IFDIR, S_IFLNK, S_IFSOCK or S_IFBLK, in other words: All kind of files on which we can apply a redirection. Why? Because the STDOUT (or another STD* pipe) of the current processus is defined as a file pointer to the redirection destination and it can be only a file, a directory, a link, a socket or a block file.

I encourage you to read the implementation of the Hoa\Console\Console::getMode method.

So yes, this is useful to enable styles on text but also to define the default verbosity level. For instance, if a program outputs the result of a computation with some explanations around, the highest verbosity level would output everything (the result and the explanations) while the lowest level would output only the result. Let’s try with the toUpperCase.php program:

$verbose = Hoa\Console\Console::isDirect(STDOUT);$string  = $argv[1];$result  = (new Hoa\String\String($string))->toUpperCase(); if(true ===$verbose)
echo $string, ' becomes ',$result, ' in upper case!', "\n";
else
echo $result, "\n"; Then, let’s execute this program: $ php toUpperCase.php 'Hello world!'
Hello world! becomes HELLO WORLD! in upper case!

And now, let’s execute this program with a pipe:

$php toUpperCase.php 'Hello world!' | xargs -I@ echo @ HELLO WORLD! Useful and very simple, isn’t it? ## Terminal capabilities We can control the terminal with the inputs, like the keyboard, but we can also control the outputs. How? With the text itself. Actually, an output does not contain only the text but it includes control functions. It’s like HTML: Around a text, you can have an element, specifying that the text is a link. It’s exactly the same for terminals! To specify that a text must be in red, we must add a control function around it. Hopefully, these control functions have been standardized in the ECMA-48 document: Control Functions for Coded Character Set. However, not all terminals implement all this standard, and for historical reasons, some terminals use slightly different control functions. Moreover, some information do not belong to this standard (because this is out of its scope), like: How many colours does the terminal support? or does the terminal support the meta key? Consequently, each terminal has a list of capabilities. This list is splitted in 3 categories: • boolean capabilities, • number capabilities, • string capabilities. For instance: • the “does the terminal support the meta key” is a boolean capability called meta_key where its value is true or false, • the “number of colours supported by the terminal” is a… number capability called max_colors where its value can be 2, 8, 256 or more, • the “clear screen control function” is a string capability called clear_screen where its value might be \e[H\e[2J, • the “move the cursor one column to the right” is also a string capability called cursor_right where its value might be \e[C. All the capabilities can be found in the documentation of terminfo(5) or in the documentation of xcurses. I encourage you to follow these links and see how rich the terminal capabilities are! ## Terminal information Terminal capabilities are stored as information in databases. Where are these databases located? In files with a binary format. Favorite locations are: • /usr/share/terminfo, • /usr/share/lib/terminfo, • /lib/terminfo, • /usr/lib/terminfo, • /usr/local/share/terminfo, • /usr/local/share/lib/terminfo, • etc. • or the TERMINFO or TERMINFO_DIRS environment variables. Inside these directories, we have a tree of the form: xx/name, where xx is the ASCII value in hexadecimal of the first letter of the terminal name name, or n/name where n is the first letter of the terminal name. The terminal name is stored in the TERM environment variable. For instance, on my computer: $ echo $TERM xterm-256color$ file /usr/share/terminfo/78/xterm-256color
/usr/share/terminfo/78/xterm-256color: Compiled terminfo entry

We can use the Hoa\Console\Tput class to retrieve these information. The getTerminfo static method allows to get the path of the terminal information file. The getTerm static method allows to get the terminal name. Finally, the whole class allows to parse a terminal information database (it will use the file returned by getTerminfo by default). For instance:

$tput = new Hoa\Console\Tput(); var_dump($tput->count('max_colors'));

/**
* Will output:
*     int(256)
*/

On my computer, with xterm-256color, I have 256 colours, as expected. If we parse the information of xterm and not xterm-256color, we will have:

$tput = new Hoa\Console\Tput(Hoa\Console\Tput::getTerminfo('xterm')); var_dump($tput->count('max_colors'));

/**
* Will output:
*     int(8)
*/

## The power in your hand: Control the cursor

Let’s summarize. We are able to parse and know all the terminal capabilities of a specific terminal (including the one of the current user). If we would like a powerful terminal API, we need to control the basis, like the cursor.

Remember. We said that the terminal is a canvas of columns and lines. The cursor is like a pen. We can move it and write something. We are going to (partly) see how the Hoa\Console\Cursor class works.

### I like to move it!

The moveTo static method allows to move the cursor to an absolute position. For example:

Hoa\Console\Cursor::moveTo($x,$y);

The control function we use is cursor_address. So all we need to do is to use the Hoa\Console\Tput class and call the get method on it to get the value of this string capability. This is a parameterized one: On xterm-256color, its value is e[%i%p1%d;%p2%dH. We replace the parameters by $x and $y and we output the result. That’s all! We are able to move the cursor on an absolute position on all terminals! This is the right way to do.

We use the same strategy for the move static method that moves the cursor relatively to its current position. For example:

Hoa\Console\Cursor::move('right up');

We split the steps and for each step we read the appropriated string capability using the Hoa\Console\Tput class. For right, we read the parm_right_cursor string capability, for up, we read parm_up_cursor etc. Note that parm_right_cursor is different of cursor_right: The first one is used to move the cursor a certain number of times while the second one is used to move the cursor only one time. With performances in mind, we should use the first one if we have to move the cursor several times.

The getPosition static method returns the position of the cursor. This way to interact is a little bit different. We must write a control function on the output, and then, the terminal replies on the input. See the implementation by yourself.

print_r(Hoa\Console\Cursor::getPosition());

/**
* Will output:
*     Array
*     (
*         [x] => 7
*         [y] => 42
*     )
*/

In the same way, we have the save and restore static methods that save the current position of the cursor and restore it. This is very useful. We use the save_cursor and restore_cursor string capabilities.

Also, the clear static method splits some parts to clear. For each part (direction or way), we read from Hoa\Console\Tput the appropriated string capabilities: clear_screen to clear all the screen, clr_eol to clear everything on the right of the cursor, clr_eos to clear everything bellow the cursor etc.

Hoa\Console\Cursor::clear('left');

See what we learnt in action:

echo 'Foobar', "\n",
'Foobar', "\n",
'Foobar', "\n",
'Foobar', "\n",
'Foobar', "\n";

Hoa\Console\Cursor::save();
sleep(1);  Hoa\Console\Cursor::move('LEFT');
sleep(1);  Hoa\Console\Cursor::move('↑');
sleep(1);  Hoa\Console\Cursor::move('↑');
sleep(1);  Hoa\Console\Cursor::move('↑');
sleep(1);  Hoa\Console\Cursor::clear('↔');
sleep(1);  echo 'Hahaha!';
sleep(1);  Hoa\Console\Cursor::restore();

echo "\n", 'Bye!', "\n";

The result is presented in the following figure.

The resulting API is portable, clean, simple to read and very easy to maintain! This is the right way to do.

### Colours and decorations

Now: Colours. This is mainly the reason why I decided to write this article. We see the same and the same libraries, again and again, doing only colours in the terminal, but unfortunately not in the right way 😞.

A terminal has a palette of colours. Each colour is indexed by an integer, from 0 to potentially + . The size of the palette is described by the max_colors number capability. Usually, a palette contains 1, 2, 8, 256 or 16 million colours.

So first thing to do is to check whether we have more than 1 colour. If not, we must not colorize the given text. Next, if we have less than 256 colours, we have to convert the style into a palette containing 8 colours. Same with less than 16 million colours, we have to convert into 256 colours.

Moreover, we can define the style of the foreground or of the background with respectively the set_a_foreground and set_a_background string capabilities. Finally, in addition to colours, we can define other decorations like bold, underline, blink or even inverse the foreground and the background colours.

One thing to remember is: With this capability, we only define the style at a given “pixel” and it will apply on the following text. In this case, it is not exactly like HTML where we have a beginning and an end. Here we only have a beginning. Let’s try!

Hoa\Console\Cursor::colorize('underlined foreground(yellow) background(#932e2e)');
echo 'foo';
Hoa\Console\Cursor::colorize('!underlined background(normal)');
echo 'bar', "\n";

The API is pretty simple: We start to underline the text, we set the foreground to yellow and we set the background to #932e2e  . Then we output something. We continue with cancelling the underline decoration in addition to resetting the background. Finally we output something else. Here is the result:

What do we observe? My terminal does not support more than 256 colours. Thus, #932e2e is automatically converted into the closest colour in my actual palette! This is the right way to do.

For fun, you can change the colours in the palette with the Hoa\Console\Cursor::changeColor static method. You can also change the style of the cursor, like ▋, _ or |.

A more complete usage of Hoa\Console\Cursor and even Hoa\Console\Window is the Hoa\Console\Readline class that is a powerful readline. More than autocompleters, history, key bindings etc., it has an advanced use of cursors. See this in action:

We use Hoa\Console\Cursor to move the cursor or change the colours and Hoa\Console\Window to get the dimensions of the window, scroll some text in it etc. I encourage you to read the implementation.

## The power in your hand: Sound 🎵

Yes, even sound is defined by terminal capabilities. The famous bip is given by the bell string capability. You would like to make a bip? Easy:

$tput = new Hoa\Console\Tput(); echo$tput->get('bell');

That’s it!

## Bonus: Window

As a bonus, a quick demo of Hoa\Console\Window because it’s fun.

The video shows the execution of the following code:

Hoa\Console\Window::setSize(80, 35);
var_dump(Hoa\Console\Window::getPosition());

foreach([[100, 100], [150, 150], [200, 100], [200, 80],
[200,  60], [200, 100]] as list($x,$y)) {

sleep(1);  Hoa\Console\Window::moveTo($x,$y);
}

sleep(2);  Hoa\Console\Window::minimize();
sleep(2);  Hoa\Console\Window::restore();
sleep(2);  Hoa\Console\Window::lower();
sleep(2);  Hoa\Console\Window::raise();

We resize the window, we get its position, we move the window on the screen, we minimize and restore it, and finally we put it behind all other windows just before raising it.

## Conclusion

In this article, we saw how to control the terminal by: Firstly, detecting the type of pipes, and secondly, reading and using the terminal capabilities. We know where these capabilities are stored and we saw few of them in action.

This approach ensures your code will be portable, easy to maintain and easy to use. The portability is very important because, like browsers and user devices, we have a lot of terminal emulators released in the wild. We have to care about them.

I encourage you to take a look at the Hoa\Console library and to contribute to make it even more awesome 😄.

# Generate strings based on regular expressions

During my PhD thesis, I have partly worked on the problem of the automatic accurate test data generation. In order to be complete and self-contained, I have addressed all kinds of data types, including strings. This article is the first one of a little series that aims at showing how to generate accurate and relevant strings under several constraints.

## What is a regular expression?

We are talking about formal language theory here. In the known world, there are four kinds of languages. More formally, in 1956, the Chomsky hierarchy has been formulated, classifying grammars (which define languages) in four levels:

1. unrestricted grammars, matching langages known as Turing languages, no restriction,
2. context-sensitive grammars, matching contextual languages,
3. context-free grammars, matching algebraic languages, based on stacked automata,
4. regular grammars, matching regular languages.

Each level includes the next level. The last level is the “weaker”, which must not sound negative here. Regular expressions are used often because of their simplicity and also because they solve most problems we encounter daily.

A regular expression is a small language with very few operators and, most of the time, a simple semantics. For instance ab(c|d) means: a word (a data) starting by ab and followed by c or d. We also have quantification operators (also known as repetition operators), such as ?, * and +. We also have {x,y} to define a repetition between x and y. Thus, ? is equivalent to {0,1}, * to {0,} and + to {1,}. When y is missing, it means \displaystyle +\infty , so unbounded (or more exactly, bounded by the limits of the machine). So, for instance ab(c|d){2,4}e? means: a word starting by ab, followed 2, 3 or 4 times by c or d (so cc, cd, dc, ccc, ccd, cdc and so on) and potentially followed by e.

The goal here is not to teach you regular expressions but this is kind of a tiny reminder. There are plenty of regular languages. You might know POSIX regular expression or Perl Compatible Regular Expressions (PCRE). Forget the first one, please. The syntax and the semantics are too much limited. PCRE is the regular language I recommend all the time.

Behind every formal language there is a graph. A regular expression is compiled into a Finite State Machine (FSM). I am not going to draw and explain them, but it is interesting to know that behind a regular expression there is a basic automaton. No magic.

### Why focussing regular expressions?

This article focuses on regular languages instead of other kind of languages because we use them very often (even daily). I am going to address context-free languages in another article, be patient young padawan. The needs and constraints with other kind of languages are not the same and more complex algorithms must be involved. So we are going easy for the first step.

## Understanding PCRE: lex and parse them

The Hoa\Compiler library provides both \displaystyle LL(1) and \displaystyle LL(k) compiler-compilers. The documentation describes how to use it. We discover that the \displaystyle LL(k) compiler comes with a grammar description language called PP. What does it mean? It means for instance that the grammar of the PCRE can be written with the PP language and that Hoa\Compiler\Llk will transform this grammar into a compiler. That’s why we call them “compiler of compilers”.

Fortunately, the Hoa\Regex library provides the grammar of the PCRE language in the hoa://Library/Regex/Grammar.pp file. Consequently, we are able to analyze regular expressions written in the PCRE language! Let’s try in a shell at first with the hoa compiler:pp tool:

$echo 'ab(c|d){2,4}e?' | hoa compiler:pp hoa://Library/Regex/Grammar.pp 0 --visitor dump > #expression > > #concatenation > > > token(literal, a) > > > token(literal, b) > > > #quantification > > > > #alternation > > > > > token(literal, c) > > > > > token(literal, d) > > > > token(n_to_m, {2,4}) > > > #quantification > > > > token(literal, e) > > > > token(zero_or_one, ?) We read that the whole expression is composed of a single concatenation of two tokens: a and b, followed by a quantification, followed by another quantification. The first quantification is an alternation of (a choice betwen) two tokens: c and d, between 2 to 4 times. The second quantification is the e token that can appear zero or one time. Pretty simple. The final output of the Hoa\Compiler\Llk\Parser class is an Abstract Syntax Tree (AST). The documentation of Hoa\Compiler explains all that stuff, you should read it. The \displaystyle LL(k) compiler is cut out into very distinct layers in order to improve hackability. Again, the documentation teach us we have four levels in the compilation process: lexical analyzer, syntactic analyzer, trace and AST. The lexical analyzer (also known as lexer) transforms the textual data being analyzed into a sequence of tokens (formally known as lexemes). It checks whether the data is composed of the good pieces. Then, the syntactic analyzer (also known as parser) checks that the order of tokens in this sequence is correct (formally we say that it derives the sequence, see the Matching words section to learn more). Still in the shell, we can get the result of the lexical analyzer by using the --token-sequence option; thus: $ echo 'ab(c|d){2,4}e?' | hoa compiler:pp hoa://Library/Regex/Grammar.pp 0 --token-sequence
#  …  token name   token value  offset
-----------------------------------------
0  …  literal      a                 0
1  …  literal      b                 1
2  …  capturing_   (                 2
3  …  literal      c                 3
4  …  alternation  |                 4
5  …  literal      d                 5
6  …  _capturing   )                 6
7  …  n_to_m       {2,4}             7
8  …  literal      e                12
9  …  zero_or_one  ?                13
10  …  EOF                           15

This is the sequence of tokens produced by the lexical analyzer. The tree is not yet built because this is the first step of the compilation process. However this is always interesting to understand these different steps and see how it works.

Now we are able to analyze any regular expressions in the PCRE format! The result of this analysis is a tree. You know what is fun with trees? Visiting them.

## Visiting the AST

Unsurprisingly, each node of the AST can be visited thanks to the Hoa\Visitor library. Here is an example with the “dump” visitor:

use Hoa\Compiler;
use Hoa\File;

$compiler = Compiler\Llk\Llk::load( new File\Read('hoa://Library/Regex/Grammar.pp') ); // 2. Parse a data.$ast      = $compiler->parse('ab(c|d){2,4}e?'); // 3. Dump the AST.$dump     = new Compiler\Visitor\Dump();
echo $dump->visit($ast);

This program will print the same AST dump we have previously seen in the shell.

How to write our own visitor? A visitor is a class with a single visit method. Let’s try a visitor that pretty print a regular expression, i.e. transform:

ab(c|d){2,4}e?

into:

a
b
(
c
|
d
){2,4}
e?

Why a pretty printer? First, it shows how to visit a tree. Second, it shows the structure of the visitor: we filter by node ID (#expression, #quantification, token etc.) and we apply respective computations. A pretty printer is often a good way for being familiarized with the structure of an AST.

Here is the class. It catches only useful constructions for the given example:

use Hoa\Visitor;

class PrettyPrinter implements Visitor\Visit {

public function visit ( Visitor\Element $element, &$handle = null,
$eldnah = null ) { static$_indent = 0;

$out = null;$nodeId = $element->getId(); switch($nodeId) {

// Reset indentation and…
case '#expression':
$_indent = 0; // … visit all the children. case '#quantification': foreach($element->getChildren() as $child)$out .= $child->accept($this, $handle,$eldnah);
break;

// One new line between each children of the concatenation.
case '#concatenation':
foreach($element->getChildren() as$child)
$out .=$child->accept($this,$handle, $eldnah) . "\n"; break; // Add parenthesis and increase indentation. case '#alternation':$oout = [];

$pIndent = str_repeat(' ',$_indent);
++$_indent;$cIndent = str_repeat('    ', $_indent); foreach($element->getChildren() as $child)$oout[] = $cIndent .$child->accept($this,$handle, $eldnah); --$_indent;
$out .=$pIndent . '(' . "\n" .
implode("\n" . $cIndent . '|' . "\n",$oout) . "\n" .
$pIndent . ')'; break; // Print token value verbatim. case 'token':$tokenId    = $element->getValueToken();$tokenValue = $element->getValueValue(); switch($tokenId) {

case 'literal':
case 'n_to_m':
case 'zero_or_one':
$out .=$tokenValue;
break;

default:
throw new RuntimeException(
'Token ID ' . $tokenId . ' is not well-handled.' ); } break; default: throw new RuntimeException( 'Node ID ' .$nodeId . ' is not well-handled.'
);
}

return $out; } } And finally, we apply the pretty printer on the AST like previously seen: $compiler    = Compiler\Llk\Llk::load(
);
$ast =$compiler->parse('ab(c|d){2,4}e?');
$prettyprint = new PrettyPrinter(); echo$prettyprint->visit($ast); Et voilà ! Now, put all that stuff together! ## Isotropic generation We can use Hoa\Regex and Hoa\Compiler to get the AST of any regular expressions written in the PCRE format. We can use Hoa\Visitor to traverse the AST and apply computations according to the type of nodes. Our goal is to generate strings based on regular expressions. What kind of generation are we going to use? There are plenty of them: uniform random, smallest, coverage based… The simplest is isotropic generation, also known as random generation. But random says nothing: what is the repartition, or do we have any uniformity? Isotropic means each choice will be solved randomly and uniformly. Uniformity has to be defined: does it include the whole set of nodes or just the immediate children of the node? Isotropic means we consider only immediate children. For instance, a node #alternation has \displaystyle c^1 immediate children, the probability \displaystyle C to choose one child is: \displaystyle P(C) = \frac{1}{c^1} Yes, simple as that! We can use the Hoa\Math library that provides the Hoa\Math\Sampler\Random class to sample uniform random integers and floats. Ready? ### Structure of the visitor The structure of the visitor is the following: use Hoa\Visitor; use Hoa\Math; class IsotropicSampler implements Visitor\Visit { protected$_sampler = null;

public function __construct ( Math\Sampler $sampler ) {$this->_sampler = $sampler; return; } public function visit ( Visitor\Element$element,
&$handle = null,$eldnah  = null ) {

switch($element->getId()) { // … } } } We set a sampler and we start visiting and filtering nodes by their node ID. The following code will generate a string based on the regular expression contained in the $expression variable:

$expression = '…';$ast         = $compiler->parse($expression);
$generator = new IsotropicSampler(new Math\Sampler\Random()); echo$generator->visit($ast); We are going to change the value of $expression step by step until having ab(c|d){2,4}e?.

### Case of #expression

A node of type #expression has only one child. Thus, we simply return the computation of this node:

case '#expression':
return $element->getChild(0)->accept($this, $handle,$eldnah);
break;

### Case of token

We consider only one type of token for now: literal. A literal can contain an escaped character, can be a single character or can be . (which means everything). We consider only a single character for this example (spoil: the whole visitor already exists). Thus:

case 'token':
return $element->getValueValue(); break; Here, with $expression = 'a'; we get the string a.

### Case of #concatenation

A concatenation is just the computation of all children joined in a single piece of string. Thus:

case '#concatenation':
$out = null; foreach($element->getChildren() as $child)$out .= $child->accept($this, $handle,$eldnah);

return $out; break; At this step, with $expression = 'ab'; we get the string ab. Totally crazy.

### Case of #alternation

An alternation is a choice between several children. All we have to do is to select a child based on the probability given above. The number of children for the current node can be known thanks to the getChildrenNumber method. We are also going to use the sampler of integers. Thus:

case '#alternation':
$childIndex =$this->_sampler->getInteger(
0,
$element->getChildrenNumber() - 1 ); return$element->getChild($childIndex) ->accept($this, $handle,$eldnah);
break;

Now, with $expression = 'ab(c|d)'; we get the strings abc or abd at random. Try several times to see by yourself. ### Case of #quantification A quantification is an alternation of concatenations. Indeed, e{2,4} is strictly equivalent to ee|eee|eeee. We have only two quantifications in our example: ? and {x,y}. We are going to find the value for x and y and then choose at random between these bounds. Let’s go: case '#quantification':$out = null;
$x = 0;$y   = 0;

// Filter the type of quantification.
switch($element->getChild(1)->getValueToken()) { // ? case 'zero_or_one':$y = 1;
break;

// {x,y}
case 'n_to_m':
$xy = explode( ',', trim($element->getChild(1)->getValueValue(), '{}')
);
$x = (int) trim($xy[0]);
$y = (int) trim($xy[1]);
break;
}

// Choose the number of repetitions.
$max =$this->_sampler->getInteger($x,$y);

// Concatenate.
for($i = 0;$i < $max; ++$i)
$out .=$element->getChild(0)->accept($this,$handle, $eldnah); return$out;
break;

Finally, with $expression = 'ab(c|d){2,4}e?'; we can have the following strings: abdcce, abdc, abddcd, abcde etc. Nice isn’t it? Want more? for($i = 0; $i < 42; ++$i)
echo $generator->visit($ast), "\n";

/**
* Could output:
*     abdce
*     abdcc
*     abcdde
*     abcdcd
*     abcde
*     abcc
*     abddcde
*     abddcce
*     abcde
*     abcc
*     abdcce
*     abcde
*     abdce
*     abdd
*     abcdce
*     abccd
*     abdcdd
*     abcdcce
*     abcce
*     abddc
*/

## Performance

This is difficult to give numbers because it depends of a lot of parameters: your machine configuration, the PHP VM, if other programs run etc. But I have generated 1 million ( \displaystyle 10^6 ) strings in less than 25 seconds on my machine (an old MacBook Pro), which is pretty reasonable.

## Conclusion and surprise

So, yes, now we know how to generate strings based on regular expressions! Supporting all the PCRE format is difficult. That’s why the Hoa\Regex library provides the Hoa\Regex\Visitor\Isotropic class that is a more advanced visitor. This latter supports classes, negative classes, ranges, all quantifications, all kinds of literals (characters, escaped characters, types of characters —\w, \d, \h…—) etc. Consequently, all you have to do is:

use Hoa\Regex;

// …
$generator = new Regex\Visitor\Isotropic(new Math\Sampler\Random()); echo$generator->visit($ast); This algorithm is used in Praspel, a specification language I have designed during my PhD thesis. More specifically, this algorithm is used inside realistic domains. I am not going to explain it today but it allows me to introduce the “surprise”. ### Generate strings based on regular expressions in atoum atoum is an awesome unit test framework. You can use the Atoum\PraspelExtension extension to use Praspel and therefore realistic domains inside atoum. You can use realistic domains to validate and to generate data, they are designed for that. Obviously, we can use the Regex realistic domain. This extension provides several features including sample, sampleMany and predicate to respectively generate one datum, generate many data and validate a datum based on a realistic domain. To declare a regular expression, we must write: $regex = $this->realdom->regex('/ab(c|d){2,4}e?/'); And to generate a datum, all we have to do is: $datum = $this->sample($regex);

For instance, imagine you are writing a test called test_mail and you need an email address:

public function test_mail ( ) {

$this ->given($regex   = $this->realdom->regex('/[\w\-_]+(\.[\w\-\_]+)*@\w\.(net|org)/'),$address = $this->sample($regex),
$mailer = new \Mock\Mailer(…), ) ->when($mailer->sendTo(\$address))
->then
->…
}

Easy to read, fast to execute and help to focus on the logic of the test instead of test data (also known as fixtures). Note that most of the time the regular expressions are already in the code (maybe as constants). It is therefore easier to write and to maintain the tests.

I hope you enjoyed this first part of the series :-)! This work has been published in the International Conference on Software Testing, Verification and Validation: Grammar-Based Testing using Realistic Domains in PHP.