flex_and_bison

View on GitHub

Notes

flex

… is used to create a scanner with the purpose of scanning input, determine what type of tokens (if any) that input holds, and pass them onwards to a parser. The token matching, if you will, is achieved through regex matching the input. The current matched input is pointed to within the yytext variable. The entire scanner routine is called yylex(), which can be called from a main function, for example. To translate the program we run flex my_program.l, which creates the c program lex.yy.c, and compile the program using cc lex.yy.c -lfl, where the -lfl tells the compiler to link the program with the flex library. When running the program, everytime it needs a token, a call is made to yylex(), which will read the input text and return a token if one is matched against the regex pattern(s). The scanner acts as a coroutine, which means that each time it returns a token, it will remember where it was and pick up from that spot the next time yylex() is called. In the scanner, when a token is ready, it will simply return the token as the value from yylex(). The next time yylex() is called, it will resume scanning the next input characters. If an input doesn’t match against the regex, no token will be produced and nothing will be returned, the scanner simply continue within the same call to yylex() scanning the next input. I.e. If a token is returned, scanning will resume on the next call to yylex(). If nothing is returned, scanning is immediately resumed.

A token returned from the scanner has two parts; the token and the token’s value. The token is a small integer, where the token numbers (the token’s identifier) are arbitrary. Except from that token 0 always means end-of-file. As bison creates the parser it assigns the tokens their numbers automatically starting at number 258 (to avoid collisions with literal character tokens), and creates a .h file containing the definitions of the token numbers. The token’s value tells which out of a group of similar tokens this particular token is.

bison

… is used to create a parser. The task of the parser is to figure out the relationship among the input tokens. One way to display such relationships is by using a parse tree. Given the arithmetic rules, the parse tree for 9 - 5 / 2 + 6 * 4 would look like this: parse tree

The parser needs rules to turn a sequence of tokens into a parse tree. These rules are derived from BNF, the standard for writing context-free grammar. Bison use a simplified version of BNF, using merely a : instead of BNF’s ::= for assignment, and a ; to mark line endings. Just like flex, the C action code is put within a pair of curly brackets at the end of every rule. The parsing is done automatically, all rules matched are remembered and the action code maintains the value associated with each symbol.