Faster find algorithms in nom
Tagua VM is an experimental PHP virtual
machine written in Rust and LLVM. It is composed as a set of libraries.
One of them that keeps me busy these days is
tagua-parser
. It contains the
lexical and syntactic analysers for the PHP language, in addition to the
AST (Abstract Syntax Tree). If you would like to know more about this
project, you can see this conference I gave at the PHPTour last week:
Tagua VM, a safe PHP virtual
machine.
The library tagua-parser
is built with parser combinators. Instead of
having a classical grammar, compiled to a parser, we write pure
functions acting as small parsers. We then combine them together. This
post does not explain why this is a sane approach in our context, but
keep in mind this is much easier to test, to maintain, and to optimise.
Because this project is complex enought, we are delegating the parser combinator implementation to nom.
nom is a parser combinators library written in Rust. Its goal is to provide tools to build safe parsers without compromising the speed or memory consumption. To that end, it uses extensively Rust's strong typing, zero copy parsing, push streaming, pull streaming, and provides macros and traits to abstract most of the error prone plumbing.
Recently, I have been working on optimisations in the FindToken
and
FindSubstring
traits from nom itself. These traits provide methods to
find a token (i.e. a lexeme), and to find a substring, crazy naming.
However, this is not totally valid: FindToken
expects to find a single
item (if implemented for u8
, it will look for a u8
in a &[u8]
),
and FindSubstring
really is about finding a substring, so a token of
any length.
It appeared that these methods can be optimised in some cases. Both
default implementations are using Rust iterators: Regular iterator for
FindToken
, and window
iterator for
FindSubstring
, i.e. an iterator over overlapping subslices of a given
length. We have benchmarked big PHP comments, which are analysed by
parsers actively using these two trait implementations.
Here are the result, before and after our optimisations:
test …::bench_span ... bench: 73,433 ns/iter (+/- 3,869)
test …::bench_span ... bench: 15,986 ns/iter (+/- 3,068)
A boost of 78%! Nice!
The pull request has been merged
today, thank you Geoffroy Couprie! The new algorithms heavily rely on
the memchr
crate. So all
the credits should really go to Andrew Gallant! This crate provides a
safe interface libc
's memchr
and memrchr
. It also provides
fallback implementations when either function is unavailable.
The new algorithms are only implemented for &[u8]
though. Fortunately,
the implementation for &str
fallbacks to the former.
This is small contribution, but it brings a very nice boost. Hope it will benefit to other projects!
I am also blowing the dust off of Algorithms on
Strings,
by M. Crochemore, C. Hancart, and T. Lecroq. I am pretty sure it should
be useful for nom and tagua-parser
. If you haven't read this book yet,
I can only encourage you to do so!