Online Documentation for Flex and Bison

Online documentation for Flex and Bison are available. Flex and Bison are slightly better and upward compatible versions of lex and yacc respectively.

Using Lex and Yacc

Lex is a lexical analyzer generator and Yacc is a parser generator. Lex programs recognize regular expressions and yacc generates parsers that accept a large class of context-free grammars. Below is a figure which shows how lex and yacc can be combined to perform the "lexical analysis" phase of a compiler.

            lexical rules               grammar rules
                |                           |
                |                           |
                |                           |
                |                           |
                V                           V
            ------------                -------------
            |          |                |           |
            |  lex     |                |   yacc    |
            |          |                |           |
            ------------                -------------
                |                           |
                |                           |
                |                           |
                |                           |
                V                           V
            ------------                -------------
            |          |                |           |
Input------>|  yylex   |--------------->|   yyparse |------->parsed
            |          |                |           |       input
            ------------                -------------

Lex

As the figure above shows, you must specify a set of lexical rules to lex in a source file. The general format of the source file is:
{definitions}
%%
{rules}
%%
{programmer subroutines}
The second %% is optional, but the first is required to mark the beginning of the rules. Here is a simple example that shows how to use lex. For details on lex syntax please refer to the lex manual. Consider the following source file.
digit [0-9]
digits {digit}+
whitespace [ \t\n]

%%
"[" {  printf("OPEN_BRAC\n");}
"]" {  printf("CLOSE_BRAC\n");}
"+" {  printf("ADDOP\n");}
"*" {  printf("MULTOP\n");}
{digits} { printf("NUMBER = %s\n", yytext);}
whitespace ;
The first four lines of this file are the definitions and the lines following %% are the rules. This source file has no programmer subroutines. Let's examine the definitions. The first line defines "digit" to be any digit in the range 0 through 9. The second line defines "digits" to be any "digit" followed by one or more "digit"s. The last line says that a "whitespace" is either a blank or a tab or a newline.

The rules sections consists of a series of rules. Each line says what action to take when the input matches a rule. The first rule says that whenever a '[' is seen, the string "OPEN_BRAC" is to be printed out. The fifth rule says that whenever a sequence of digits the word "NUMBER" should be printed out. In addition lex automatically provides you access to the text that was just matched (in the variable yytext). In this case, the number that was just matched is stored in yytext. The last rule states that when a whitespace is matched, no action is to be taken.

Let's say you defined the above source in a file called "foo.lex". Then in order to generate the lexical analyzer do the following:

lex foo.lex
cc lex.yy.c -ll
After the first command, lex creates a file called lex.yy.c, which is a C program. The second line actually compiles this C program and creates an executable file called a.out, which you can now run on some input. a.out expects it's input from standard input (the terminal). So if you run a.out and type in the following string as input:
[ 1 + 2 * 3 ]
it gives the following output:
OPEN_BRAC
 NUMBER = 1
 ADDOP
 NUMBER = 2
 MULTOP
 NUMBER = 3
 CLOSE_BRAC
Go ahead and try following the steps above and see what happens when you type in different inputs. Here is one other example. If you type in:
[1+2**3
it gives:
OPEN_BRAC
NUMBER = 1
ADDOP
NUMBER = 2
MULTOP
MULTOP
NUMBER = 3
Note that it just recognizes tokens from the input and it doesn't really know anything about whether the expression is "syntactically correct" or not. In order to parse the input based on a grammar that is meaningful to you, you will have to use yacc. That is described below.

Yacc

The figure above shows that you must feed yacc with grammar rules in order for it to generate a parser which can parse the input. The parser gets it's input (a sequence of tokens) from the lexical analyzer (created using lex).

The grammar rules must be put into a source file which has the following format:

{declarations}
%%
{rules}
%%
{programs}
Again, the declarations and programs sections are optional. The rules section is made up of one or more grammar rules of the form:
A : BODY ;
where A represents a non-terminal name and BODY represents a sequence of zero or more names and literals. The ':' and ';' are yacc punctuation. Here is an example of how a yacc source file may look:
%start mystartsymbol

%token 	ADDOP MULTOP NUMBER OPEN_BRAC CLOSE_BRAC

%left ADDOP 
%left MULTOP

%%

mystartsymbol : expr { printf("the value of the expression is %d\n", $1);}

expr :	OPEN_BRAC expr CLOSE_BRAC { $$ = $2; }
     |	expr ADDOP expr { $$ = $1 + $3 ;}
     |	expr MULTOP expr { $$ = $1 * $3 ;}
     |  NUMBER { $$ = $1; }
	;

%%
/* start of programs */
#include 
#include "lex.yy.c"

main()
{
 return  yyparse();
}

yyerror(char *s)
{
 fprintf(stderr,"%s\n",s);
}
In the first section (declarations), we are declaring "mystartsymbol" to be the start symbol of the grammar. The %token declaration says that all the following symbols will be the terminals. The %left (and similarly %right) declarations can be used to specify a) associativity of the operators and b) precedence. In this case, ADDOP and MULTOP are declared to the left associative and MULTOP is given a higher precedence that ADDOP (in case of an ambiguity).

The first rule says that a startsymbol can be derived by an expr. The "value" the symbol on the left hand side (LHS) is denoted by $$. The value of symbols on the RHS are denoted by $1, $2, $3, .... etc. So in the first rule, the expr on the RHS is denoted by $1. Whatever appears between '{' and '}' is the action to be taken when that rule is matched.

The second rule (which actually consists of four different rules) describes the different ways in which an expr can be parsed. The first one says that any expr that occurs between an OPEN_BRAC and a CLOSE_BRAC is also an expr and its value is the same as the expr on the RHS. The next rule says that an expr followed by ADDOP followed by expr is also and expr, and it's value is the sum of two expr's on the RHS. The last rule says that an expr can be derived by a NUMBER.

The programmer is responsible for providing two functions, main() and yyerror(char *) which are needed to start the parser and to report errors respectively. Note that the file "lex.yy.c" (which is the output of running lex on the lex source file) has been included in the programs section. This is one way of combining lex and yacc -- the output of the lexical analyzer is fed directly to the parser as shown in the figure above.

In order for the above yacc program to work correctly modify the lex program described earlier to that it now looks like:

digit [0-9]
digits {digit}+
whitespace [ \t\n]

%%
"[" {  return (OPEN_BRAC);}
"]" {  return (CLOSE_BRAC);}
"+" {  return (ADDOP); }
"*" {  return (MULTOP);  }
{digits} { yylval = atoi(yytext); return (NUMBER); }
whitespace ;
The printf's have been removed and instead we just return the appropriate token. Also, when the token is a NUMBER, we would also like to return the associated value, which we do by setting the pre-defined variable yylval. atoi is a pre-defined C function that converts a string to an integer. (e.g. "12" gets converted to 12.)

Play around with the above lex and yacc programs and try to understand the way the lexical analyzer and parser work and how they interact. So you will need to do the following:

  1. Create a file called expr.lex that contains the above lex program just described.
  2. Create a file called expr.yacc that contains the yacc program described above .
  3. Run the commands:
    lex expr.lex
    yacc expr.yacc
    
  4. This will create a file called y.tab.c. Now compile this file by;:
    cc y.tab.c -ll
    
  5. This creates a file called a.out which is the parser integrated with the lexical analyzer.
  6. Create an input file baz.input that contains some input like [ 1 + 2 *3].
  7. Run the program as follows:
    a.out < baz.input	
    
  8. Repeat the previous step with different input files.

Remarks

There are a number of details that have left out in the above examples. Also, both lex and yacc have many features that don't appear in these examples, which you will need to use later on. Read the lex/flex and yacc/bison manuals and get familiar with the syntax and behavior of both. In order to understand lex and yacc it helps to know a little bit of C.

The lex source files described above is available here. The yacc source file is available here. Here is a shell script that can be used to compile the above lex and yacc files. Make sure that the script is executable before you try to run it.