« 2017-12-23 | Tinkering | 2017-07-11 » |
Tinkering: 2017-12-19: Regular Expression Parser in C Using Continuation Passing
For a long time, on and off, I have been trying to implement an elegant regular expression parser in C in the minimal amount of (readable) lines of code. The goal was to parse and match in one phase, i.e., without first constructing a tree representing the regular expression. The intuition tells me that it should be possible using recursion, but recursion really only works well for passing the regex, but not for matching the string against it, because the string to be matched has no recursive structure like the pattern, but it is just a string that needs to be matched character by character. Indeed, a straight-forward recursive approach does not work.
What would be needed would be to return alternatives from a C function, i.e., I thought, each parse step could return alternatives of what would be possible to match, so that the caller can then try out the alternatives until one matches. Writing this in C is not directly possible. What I do not want is to pre-parse, so I do not want to return for a whole pattern and then match it. And, well, it still does not work in a straight forward way, e.g., matching "a+(ab)+" against "aaab" does not work by first matching 'a's and then matching 'ab's, but the second part of the pattern requires to backtrack and only match two 'a's before turning to matching 'ab's. This backtrack step runs counter to matching regex substructures like '(...)', so does not map to simple recursion.
Then, it struck me: of course, I need to use continuation passing. This style indeed allows me to 'return' multiple times, which enables me to try out all alternatives until one matches.
Returning multiple times in C? Well, yes, but it requires to write the program in a very non-C-like way. While the result will be elegant in the source language C in some ways, the C compiler will most probably not be up to the task of understanding what's really going on, so will not optimise well, resulting in a lot of stack that will be wasted. But efficiency is not my problem today...
The idea of using continuations means to pass a function parameter to represent how the program continues when the called function returns. This means translating each 'return' into a function call that takes the result. E.g.
int foo(int a) { return a*a; }
To translate this to continuation style, 'foo' receives an additional function parameter to implement 'return'. 'foo' itself returns 'void' instead.
void foo(int a, void (*cont)(int)) { cont(a*a); }
In order to call 'foo', a continuation 'cont' needs to be passed. Well, it turns out, that this is not enough context. Most of the time, continuations need data to continue themselves. In C, this data needs to be encoded as an additional parameter, because there are no closures and lambdas, which would allow to pass a function with its execution context, but this is not possible in pure C. Instead, we pass a data parameter additional to the continuation function. E.g., to implement a function based on the 'foo' function that divides 1050 by the square of an argument, we use this additional data to 'return'. To try this out, the following program computes and then prints '42':
#include <stdio.h> int foo(int a) { return a*a; } int bar(int a) { return 1050 / foo(a); } int main(void) { printf("%d\n", bar(5)); return 0; }
To translate this to continuation passing style (CPS), we introduce that additional data argument as a struct. Most importantly, it will contain another continuation function and its data to continue the 'return' chain. This way, we can tell the inner continuation how to continue inside the continuation. The restructured program looks as follows:
#include <stdio.h> typedef struct data data_t; typedef void (*cont_t)(int result, data_t *); struct data{ cont_t cont; data_t *data; }; void foo(int a, cont_t cont, data_t *data) { cont(a*a, data); } void bar_cont(int a, data_t *data) { data->cont(1050 / a, data->data); } void bar(int a, cont_t cont, data_t *data) { data_t data2 = { cont, data }; foo(a, bar_cont, &data2); } void main_cont(int a, data_t *data) { printf("%d\n", a); } int main(void) { bar(5, main_cont, 0); }
This does the same as the program above – only it is now hardly readable anymore...
So what did we gain? For the regex parsing, we gained that we can now 'return' multiple times. E.g., we can rewrite 'foo' to return both the square and the cube, and the program will then indeed work by printing both '1050 / (x*x)' and '1050 / (x*x*x)':
void foo(int a, cont_t cont, data_t *data) { cont(a*a, data); cont(a*a*a, data); }
This now prints '42' and '8'.
A bit puzzling, maybe, but it works, because we pass 'cont' and 'data', which contain all the information to continue the program after computing 'a*a'. So we can just run that tail of the program twice by calling 'cont' twice. Effectively, that's like 'returing twice'.
We gain another interesting thing: exceptions. By translating 'return' to a function call to 'cont', the continuation, we are left with a return value of 'void'. We can, however, return something else. If every call to a continuation is rewritten as 'return cont(...)', immediately returning sometimg is essentially an exception. Interpreting the result is then 'catching' the exception.
In the above example, we could map 'division by zero' to returning '0' ('no success') immediately, and returning '1' ('success') otherwise. Whenever we want to 'return' multiple times, the exception value needs to be handled properly, too. E.g., 'foo' can check for a successful continuation and then act accordingly. Like so:
#include <stdio.h> typedef struct data data_t; typedef int (*cont_t)(int result, data_t *); struct data{ cont_t cont; data_t *data; }; int foo(int a, cont_t cont, data_t *data) { return cont(a*a, data) && cont(a*a*a, data); } int bar_cont(int a, data_t *data) { if (a == 0) { return 0; /* exception 'division by zero' */ } return data->cont(1050 / a, data->data); } int bar(int a, cont_t cont, data_t *data) { data_t data2 = { cont, data }; return foo(a, bar_cont, &data2); } int main_cont(int a, data_t *data) { printf("%d\n", a); return 1; /* success */ } void compute(int a) { if (!bar(a, main_cont, 0)) { printf("division by zero\n"); } } int main(void) { compute(5); compute(0); }
Coming back to the regex matcher, syntax errors will be mapped to exceptions, while having a match is passed to the continuation as a boolean value. The final step in the regex parser (in 're_match_tail') is to translate the boolean into an actual result to return (RE_E_OK or RE_E_FAIL) so that this function has a 'normal' C style functional API.
The regex example also shows how the 'data_t' parameter passes more context data to the continuation. Basically, all local variables that are live across a function invocation must now be part of the continuation context data in order to be used in the continuation code.
One drawback of the CPS technique in C is its stack usage, especially because it is quite hard to understand at all what is going on, which makes it very hard to predict how much stack is used. Usually, I get SIGSEGV from pointer misuses (typically NULL), but in debugging this program, instead I got an infinite recursion causing SIGSEGV a couple of times.
As already mentioned, C compilers do not really understand this style. We actually store information on the stack so just discarding the stack frames and using tail calls is not always trivial for the compiler. I did some tests and Clang is currently not very good at this, being able to convert only a small fraction of calls into tail calls. GCC did much better in my experiments.
The result of this experiment is a regex parser that is short and runs in a single phase, parsing the regex while matching it, without intermediate representation. It is short, and maybe even elegant, but it is not easy to understand. I had to think quite a lot to get this right, especially the '|' (but to a lesser degree, also the '*', '?', '+', '{...}' part). The result is just a few lines of code for '|' handling, but it really took some time to get it right. To see why, try to fully understand how 're_seq_or_tail' and 're_seq_tail' work.
And if you spot bugs: please tell me. :-)
Of course, I am not the first: other people have done regex parsers using CPS, but usually in programming languages that cope much better with continuations, like Haskell. I wanted to show that it is possible in C, too.
So here it is:
regexp.c regexp.h regexp_internal.h test.c Makefile gpl-3.0.txt
The story continues here with a O(1) stack version of the parser.