yet another introduction to yacc
One of the great tools in the unix toolbox is yacc. Regrettably, the documentation can be somewhat weak. The OpenBSD man page covers command line options, but doesn’t even provide a reference to the grammar of the input file. For that, one must read Stephen Johnson’s paper, Yacc: Yet Another Compiler-Compiler. It’s pretty good, and there’s some other tutorials out there, but perhaps it’s worth highlighting a few tips and tricks.
Here are my notes on using yacc, particularly as regards creating parsers for config files and such. Everybody starts by using yacc to make a toy calculator (probably because that’s the sample Johnson provides), but I’ve never needed to do that myself. I’ll compare pfctl, which has been around for a while and evolved quite a bit, with doas and its rather younger parser.
First off, it’s worth noting that the majority of parsers in OpenBSD use handwritten lexers, not lex. It’s not particularly difficult to tokenize by hand, and as soon as you start doing anything complicated like strings or escape sequences, I find it much easier to mentally unpack a simple while loop. From the lex man page (funny that lex, the tool we don’t use, has a complete manual), an example for a toy Pascal like language: "{"[^}\n]*"}"
Yeesh. That’s not actually a terribly complicated regex, and I do like using regex when appropriate, but they can be hard to work with. C makes it easier to distinguish special characters for our language (anything we add a switch case for), versus special characters for lex and regex. The above would simply be a loop eating characters, only breaking for } and newline. Also, doing it by hand, usually at the bottom of the .y file means everything is in one place, with much less coordination between files and tools.
The format of the input file to yacc is pretty weird. It’s got some C code, then some yacc code with C mixed in, then some more C code. This makes it awkward to explain, but it works well in practice.
About the simplest yacc grammar I can make is one that simply reads and counts characters. It’s got all the requisite parts, which are called the declarations, rules, and program.
%{
#include <stdio.h>
int nchars;
%}
%token CHAR
%%
The top of the file is the declarations. Many files have a special declaration of some C code. Then comes the actual tokens in our grammar. The %% ends the declaration section and begins the rules.
top: chars {
printf("got %d chars\n", nchars);
}
;
chars: /* empty*/
| chars CHAR {
printf("got a char\n");
nchars++;
}
;
%%
We have two very simple rules. The top one, conveniently called top, is where yacc will start by default. We’re going to eat some chars and print out how many. This requires counting chars as we see them. Finally, another %% ends the rules section. Now comes the program.
void
yyerror(const char *fmt, ...)
{
printf("didn't like that. at all.\n");
}
int
yylex(void)
{
return getchar() == EOF ? -1 : CHAR;
}
int
main(int arc, char **argv)
{
yyparse();
}
In order for yacc to work, it requires the user provide yylex and yyerror functions, and somewhere in the program it needs to call yyparse to make the magic happen. This example just parses stdin as simply as possible.
It’s not necessary or desirable to include main in the yacc file for a real program, but it’s a good place to keep yylex. The token types are all declared at the top of the same file.
The top rule in doas/parse.y is good to examine for a slightly more interesting example.
grammar: /* empty */
| grammar '\n'
| grammar rule '\n'
| error '\n'
;
In order to make a list (array, collection, etc.) in yacc, we use this recipe. Always start with a possibly empty match, then accumulate items at the end. This is pretty well known, but worth repeating since it’s so important.
Another trick used here is to match newlines explicitly. Sometimes it’s convenient to have the lexer eat all the whitespace, but a lot of file formats lend themselves to line by line parsing.
Error handling with yacc can be a bit of a chore. By default, it’s pretty brutal, printing one error and exiting, which makes it frustrating for the user. At a minimum, recovering from errors on a line by line basis lets us fix more than one problem at a time. By indicating that errors can be followed by newlines, we can restart the parser.
Speaking of error handling, there’s some question of when is the best time to validate certain options. pfctl and doas take a rather different approach here. pfctl does a lot of conversion during the parse, meaning that if you specify port “ww” by accident instead of “www”, that’s considered a syntax error. Parsing will then skip over the rest of the rule, and if there are any other errors they won’t be reported until the first one is fixed. doas does very little checking in comparison, so it’s not apples to apples, but conversion of usernames to userids and such is done after all the rules are parsed.
This reflects more of a different in tool design and operation, I think, than in development philosophy, but it’s made some things easier in doas. pfctl differentiates between strings and numbers at the token level, which means there’s a few places in the grammar where only a string is accepted, but not a number, requiring “42” as a workaround. Similarly (or the opposite), in doas certain constructs like identity, user or :group, are not reflected in the grammar. Looking for the leading colon to distinguish groupnames isn’t done until rule evaluation. The benefit is that colons can be used unquoted in strings throughout the config file; they are not special tokens.
This turned out to be quite important for doas. Command line arguments and environment variables can contain pretty much any character. We don’t want the config parser to be treating them as special tokens. One might also consider the effect of having lots of keywords. Johnson provides a workaround, the lexical backdoor, but without it one can see some surprising syntax errors. Trying to forward traffic to a host called overload? Oops, that’s a reserved keyword.
In the process of writing this up, I reread Johnson’s original yacc paper and discovered he covers all the same material. It’s all in there, but maybe highlighting a few parts is helpful. I’m grateful to the OpenBSD developers who did pay attention as they read the paper for fixing up the doas parser and adding these missing bits to it.