The C galaxy
The galaxy we will explore today is the C galaxy. This post will explain what C is (shortly), how to compile any Rust program in C in theory, and how to do that practically with our Rust parser from the Rust side and the C side. We will also see how to test such a binding.
What is C, and why?
C is probably the most used and known programming language in the world. Quoting Wikipedia:
C […] is a general-purpose, imperative computer programming language, supporting structured programming, lexical variable scope and recursion, while a static type system prevents many unintended operations. By design, C provides constructs that map efficiently to typical machine instructions, and therefore it has found lasting use in applications that had formerly been coded in assembly language, including operating systems, as well as various application software for computers ranging from supercomputers to embedded systems.
The impact of C is probably without precedent on the progamming language world. Almost everything is written in C, starting with operating systems. Today, it is one of the few common denominator between any programs on any systems on any machines in the world. In other words, being compatible with C opens a large door to everything. Your program will be able to talk directly to any program easily.
Because languages like PHP or Python are written in C, in our particular Gutenberg parser usecase, it means that the parser can be embedded and used by PHP or Python directly, with almost no overhead. Neat!
Rust 🚀 C
In order to use Rust from C, one may need 2 elements:
- A static library (
.a
file), - A header file (
.h
file).
The theory
To compile a Rust project into a static library, the crate-type
property must contain the staticlib
value. Let's edit the Cargo.toml
file such as:
[]
= "gutenberg_post_parser"
= ["staticlib"]
Once cargo build --release
is run, a libgutenberg_post_parser.a
file
is created in target/release/
. Done. cargo
and rustc
make this
step really a doddle.
Now the header file. It can be written manually, but it's tedious and it
gets easily outdated. The goal is to automatically generate it. Enter
cbindgen
:
cbindgen
can be used to generate C bindings for Rust code. It is currently being developed to support creating bindings for WebRender, but has been designed to support any project.
To install cbindgen
, edit your Cargo.toml
file, such as:
[]
= "build.rs"
[]
= "^0.6.0"
Actually, cbindgen
comes in 2 flavors: CLI executable, or a library. I
prefer to use the library approach, which makes installation easier.
Note that Cargo has been instructed to use the build.rs
file to build
the project. This file is an appropriate place to generate the C headers
file with cbindgen
. Let's write it!
extern crate cbindgen;
With those information, cbindgen
will scan the source code of the
project and will generate C headers automatically in the
dist/gutenberg_post_parser.h
header file. Scanning will be detailed in
a moment, but before that, let's quickly see how to control the content
of the header file. With the code snippet above, cbindgen
will look
for a cbindgen.toml
configuration file in the CARGO_MANIFEST_DIR
directory, i.e. the root of your crate. Mine looks like this:
= """
/*
Gutengerg Post Parser, the C bindings.
Warning, this file is autogenerated by `cbindgen`.
Do not modify this manually.
*/"""
= 4
= "C"
It describes itself quite easily. The documentation details the configuration very well.
cbindgen
will scan the code and will stop on struct
s or enum
s that
have the decorator #[repr(C)]
, #[repr(_size_)]
or
#[repr(transparent)]
, or functions that are marked as extern "C"
and
are public. So when one writes:
pub extern "C" parse
Then cbindgen
will generate this:
// … header comment …
typedef struct Slice;
typedef enum Option_Tag;
typedef struct Some_Body;
typedef struct Option;
void ;
It works; Great!
Note the #[no_mangle]
that decorates the Rust parse
function. It
instructs the compiler to not rename the function, so that the function
has the same name from the perspective of C.
OK, that's all for the theory. Let's practise now, we have a parser to bind to C!
Practise
We want to bind a function named parse
. The function outputs an AST
representing the language being analysed. For the
recall, the
original AST looks like this:
This AST is defined in the Rust parser. The Rust binding to C will
transform this AST into another set of structs and enums for C. It is
mandatory only for types that are directly exposed to C, not internal
types that Rust uses. Let's start by defining Node
:
Some immediate thoughts:
- The structure
Slice_c_char
emulates Rust slices (see below), - The enum
Option_c_char
emulatesOption
(see below), - The field
children
has type*const c_void
. It should be*const Vector_Node
(our definition ofVector
), but the definition ofNode
is based onVector_Node
and vice versa. This cyclical definition case is unsupported bycbindgen
so far. So… yes, it is defined as avoid
pointer, and will be casted later in C, - The fields
namespace
andname
are originally a tuple in Rust. Tuples have no equivalent in C withcbindgen
, so two fields are used instead.
Let's define Slice_c_char
:
This definition borrows the semantics of Rust' slices. The major benefit is that there is no copy when binding a Rust slice to this structure.
Let's define Option_c_char
:
Finally, we need to define Vector_Node
and our own Result
for C.
They mimic the Rust semantics closely:
Alright, all types are declared! It's time to write the parse
function:
pub extern "C"
The function takes a pointer from C. It means that the data to analyse (i.e. the Gutenberg blog post) is allocated and owned by C: The memory is allocated on the C side, and Rust is only responsible of the parsing. This is where Rust shines: No copy, no clone, no memory mess, only pointers to this data will be returned to C as slices and vectors.
The workflow will be the following:
- First thing to do when we deal with C: Check that the pointer is not null,
- Reconstitute an input from the pointer with
CStr
. This standard API is useful to abstract C strings from the Rust point of view. The difference is that a C string terminates by aNULL
byte and has no length, while in Rust a string has a length and does not terminate with aNULL
byte, - Run the parser, then transform the AST into the “C AST”.
Let's do that!
pub extern "C"
Only pointers are used in Vector_Node
: Pointer to the output, and the
length of the output. The conversion is light.
Now let's see the into_c
function. Some parts will not be detailed;
Not because they are difficult but because they are repetitive. The
entire code lands
here.
I want to show namespace
for the warm-up (name
, attributes
and
Phrase
are very similar), and children
because it deals with void
.
Let's convert ast::Node::Block.name.0
into Node::Block.namespace
:
=> Block
Pretty straightforward so far. namespace
is a Slice_c_char
. The
pointer
is the pointer of the name.0
slice, and the length
is the
length of the same name.0
. This is the same process for other Rust
slices.
children
is different though. It works in three steps:
- Collect all children as C AST nodes in a Rust vector,
- Transform the Rust vector into a valid
Vector_Node
, - Transform the
Vector_Node
into a*const c_void
pointer.
=> Block
Step 1 is straightforward.
Step 2 defines what is the behavior when there is no node. In other
words, it defines what an empty Vector_Node
is. The buffer
must
contain a NULL
raw pointer, and the length is obviously 0. Without
this behavior I got various segmentation fault in my code, even if I
checked the length
before the buffer
. Note that Vector_Node
is
allocated on the heap with Box::new
so that the pointer can be easily
shared with C.
Step 3 uses the Box::into_raw
function
to consume the box and to return the wrapped raw pointer of the data it
owns. Rust will not free anything here, it's our responsability (or the
responsability of C to be pedantic). Then the *mut Vector_Node
returned by Box::into_raw
can be freely casted into *const c_void
.
Finally, we instruct the compiler to not drop output
when it goes out
of scope with mem::forget
(at this step of the series, you are very
likely to know what it does).
Personally, I spent few hours to understand why my pointers got random
addresses, or were pointing to a NULL
data. The resulting code is
simple and kind of clear to read, but it wasn't obvious for me what to
do beforehand.
And that's all for the Rust part! The next section will present the C code that calls Rust, and how to compile everything all together.
C 🚀 executable
Now the Rust part is ready, the C part must be written to call it.
Minimal Working Example
Let's do something very quick to see if it links and compiles:
int
To keep the code concise, I left all the error handlers out of the example. The entire code lands here if you're curious.
What happens in this code? The first thing to notice is
#include "gutenberg_post_parser.h"
which is the header file that is
automatically generated by cbindgen
.
Then a filename from argv[1]
is used to read a blog post to parse. The
parse
function is from Rust, just like the Result
and Vector_Node
types.
The Rust enum Result { Ok(Vector_Node), Err }
is compiled to C as:
typedef enum Result_Tag;
typedef struct Ok_Body;
typedef struct Result;
No need to say that the Rust version is easier and more compact to read,
but this isn't the point. To check if Result
contains an Ok
value or
an Err
or, one has to check the tag
field, like we did with
output.tag == Err
. To get the content of the Ok
, we did
output.ok._0
(_0
is a field from Ok_Body
).
Let's compile this with clang
! We assume that
this code above is located in the same directory than the
gutenberg_post_parser.h
file, i.e. in a dist/
directory. Thus:
# Output executable name. \
# Input source file. \
# Directory where to find the static library (*.a). \
# Link with the gutenberg_post_parser.h file. \
# Other libraries to link with.
And that's all! We end up with a gutenberg-post-parser
executable that
runs C and Rust.
More details
In the original source
code,
a recursive function that prints the entire AST on stdout
can be
found, namely print
(original, isn't it?). Here is some side-by-side
comparisons between Rust syntax and C syntax.
The Vector_Node
struct in Rust:
The Vector_Node
struct in C:
typedef struct Vector_Node;
So to respectivelly read the number of nodes (length of the vector) and the nodes in C, one has to write:
const uintptr_t number_of_nodes = nodes->length;
for
This is almost idiomatic C code!
A Node
is defined in C as:
typedef enum Node_Tag;
typedef struct Block_Body;
typedef struct Phrase_Body;
typedef struct Node;
So once a node is fetched, one can write the following code to detect its kind:
if else if
Let's focus on Block
for a second, and let's print the namespace and
the name of the block separated by a slash (/
):
const Block_Body block = node.block;
const Slice_c_char namespace = block.namespace;
const Slice_c_char name = block.name;
;
The special %.*s
form in printf
allows to print a string based on
its length and its pointer.
I think it is interesting to see the cast from void to Vector_Node
for
children
. It's a single line:
const Vector_Node* children = ;
I think that's all for the details!
Testing
I reckon it is also interesting to see how to unit test C bindings directly with Rust. To emulate a C binding, first, the inputs must be in “C form”, so strings must be C strings. I prefer to write a macro for that:
And second, the opposite: The parse
function returns data for C, so
they need to be “converted back” to Rust. Again, I prefer to write a
macro for that:
All right! The final step is to write a unit test. As an example, a
Phrase
will be tested; The idea remains the same for Block
but the
code is more concise for the former.
What happens here? The input
and output
have been prepared. The
former is the C string "foo"
. The latter is the result of parse
.
Then there is a match
to validate the form of the AST. Rust is very
expressive, and this test is a good illustration. The Vector_Node
branch is activated if and only if the length of the vector is 1, which
is expressed with the guard if length == 1
. Then the content of the
phrase is transformed into a Rust string and compared with a regular
assert_eq!
macro.
Note that —in this case— buffer
is of type *const Node
, so it
represents the first element of the vector. If we want to access the
next elements, we would need to use the Vec::from_raw_parts
function
to get a proper Rust API to manipulate this vector.
Conclusion
We have seen that Rust can be embedded in C very easily. In this
example, Rust has been compiled to a static library, and a header file;
the former is native with Rust tooling, the latter is automatically
generated with cbindgen
.
The parser written in Rust manipulates a string allocated and owned by C. Rust only returns pointers (as slices) to this string back to C. Then C has no difficulties to read those pointers. The only tricky part is that Rust allocates some data (like vectors of nodes) on the heap that C must free. The “free” part has been omitted from the article though: It does not represent a big challenge, and a C developer is likely to be used to this kind of situation.
The fact that Rust does not use a garbage collector makes it a perfect
candidate for these usecases. The story behind these bindings is
actually all about memory: Who allocates what, and What is the form of
the data in memory. Rust has a #[repr(C)]
decorator to instruct the
compiler to use a C memory layout, which makes C bindings extremely
simple for the developer.
We have also seen that the C bindings can be unit tested within Rust
itself, and run with cargo test
.
cbindgen
is a precious companion in this adventure, by automating the
header file generation, it reduces the update and the maintenance of the
code to a build.rs
script.
In terms of performance, C should have similar results than Rust, i.e. extremely fast. I didn't run a benchmark to verify this statement, it's purely theoretical. It can be a subject for a next post!
Now that we have successfully embedded Rust in C, a whole new world opens up to us! The next episode will push Rust in the PHP world as a native extension (written in C). Let's go!