LEX AND YACC - COMPILER WRITER'S TOOLS FOR TURBO PASCAL
                             VERSION 2.0

                            Albert Graef
                            FB Mathematik
                Johannes Gutenberg-Universitaet Mainz

                          November 22, 1989

             (ASCII version of manual: January 12, 1990)


                               ABSTRACT

We describe a reimplementation of the compiler writer's tools Lex  and
Yacc for Borland's Turbo Pascal, running under MS-DOS. These  programs
are useful tools for the development of compilers, lexical  analyzers,
and similar applications, and are intended for experienced Turbo  Pas-
cal programmers with a background in compiler design, and for  courses
in compiler construction.

Note: This  is a  raw ASCII  facsimile of  the original  Lex and  Yacc
manual contained in the TeX DVI file MAN.DVI on the distribution disk.

CONTENTS

Introduction
1. Installation
2. Lex
   2.1 Regular Expressions
   2.2 Actions
   2.3 Lex Library
   2.4 Character Tables
   2.5 Turbo Pascal Tie-ins
   2.6 Implementation Restrictions, Bugs, and Incompatibilities
3. Yacc
   3.1 Actions
   3.2 Lexical Analysis
   3.3 Yacc Library
   3.4 Ambiguity
   3.5 Error Handling
   3.6 Arbitrary Value Types
   3.7 Debugging
   3.8 Yacc Language Grammar
   3.9 Additional Features, Implementation Restrictions and Bugs
Conclusion
References
Appendix

INTRODUCTION

This manual  describes two  popular compiler  writer's tools,  Lex and
Yacc, which have seen extensive use on the UNIX system, and have  been
reimplemented by the author for the MS-DOS operating system.

The original (UNIX) versions of these programs are described in [2,3].
Other public domain and commercial remakes of Lex and Yacc are  avail-
able under MS-DOS, e.g. from DECUS and Mortice Kern Systems.  However,
in difference to these implementations, the programs described in this
manual are for use with Borland's Turbo Pascal, rather than with C. In
particular,  they  support  Turbo  Pascal  as  their  host  and target
programming language.

The Turbo Pascal Lex and Yacc versions are an independent  development
of the author, not containing any fragments from the original sources,
or other UNIX stuff (happily,  the theory underlying Lex and  Yacc has
been published, e.g., in [1], und thus is public domain). However, the
names Lex and Yacc (which, as far as I know, are not copyrighted)  are
justified by the fact that the programs described here are intended to
be approximately compatible with the original versions.

The intended audience for  this manual are experienced  (Turbo Pascal)
programmers with knowledge of the  basics of formal language and  com-
piler  theory  and  students  in  compiler  construction  courses, not
novices or casual programmers. Thus, the manual is particularily  con-
cise and compact, while trying to cover all essential aspects of Turbo
Pascal Lex and Yacc.

As  a  supplementary  text,  we  strongly recommend the famous "dragon
book"  of  Aho,  Sethi  and  Ullman  [1]  which  covers  all important
theoretical and practical aspects  of compiler design and  implementa-
tion.

The manual is organized as follows: Section 1 covers installation  re-
quirements and the installation process. Section 2 treats the  lexical
analyzer  generator  Lex;  subsections  are  devoted  to the format of
regular expressions,  the implementation  of actions,  the Lex library
unit, character tables, Turbo Pascal tie-ins, implementation  restric-
tions, bugs,  and incompatibilities.  Section 3  discusses the  parser
generator Yacc; subsections  deal with actions  in Yacc grammars,  the
lexical  analyzer  routine  used  by  Yacc-generated parsers, the Yacc
library unit, ambiguities in Yacc grammars, syntactic error  handling,
arbitrary value types in actions, the debugging of Yacc-generated par-
sers, the Yacc language syntax, and, finally, additional features, im-
plementation restrictions and bugs.  The appendix contains short  des-
criptions of Lex and Yacc in the style of UNIX manual pages.

Note: Throughout the manual, the terms Lex and Yacc refer to the Turbo
Pascal versions. The original (UNIX) versions are denoted by the terms
UNIX or standard Lex resp. Yacc.

1. INSTALLATION

Installation requirements:
- IBM PC/XT/AT or compatible, 512 KB RAM
- MS-DOS 3.10 or later (may also run under MS-DOS 2.x, but I have  not
  tested it)
- Turbo Pascal 4.0 or later (has been tested with 4.0 and 5.0)

To install Turbo Pascal Lex and Yacc, simply copy the contents of  the
distribution disk to an  appropriate disk and/or directory.  You might
also  wish  to  put  this  directory  on  your  DOS path. The programs
generated with  Lex and  Yacc will  need the  LexLib and YaccLib units
(.tpu files on the distribution disk) when compiled, so you might have
to put them  anywhere the Turbo  Pascal compiler finds  them (e.g., in
the turbo.tpl library).

Here's the contents of the distribution disk:
lex.exe     the Lex program
lexlib.*    source and .tpu file for the LexLib unit
yacc.exe    the Yacc program
yacclib.*   source and .tpu file for the YaccLib unit
read.me     if present, contains addenda to the manual
makefile    makefile for the sample programs
*.l, *.y    sample Lex and Yacc programs
man.dvi     TeX dvi file for the manual

As shipped, the LexLib and YaccLib units are compiled with Turbo  Pas-
cal 4.0. If you're running Turbo Pascal 5.0 or later, you will have to
recompile lexlib.pas and  yacclib.pas with your  version of the  Turbo
Pascal compiler.

You can use the  makefile to compile the  sample programs on the  dis-
tribution disk (see the Turbo Pascal manual for a description of  Bor-
land's make, and the makefile for  a description of its usage and  the
sample programs).

To run Turbo Pascal Lex and Yacc on your grammar sourcefiles, refer to
the manual pages in the appendix for a description of the Lex and Yacc
command formats.

2. LEX

Lex is a program to generate lexical analyzers from a given set of in-
put patterns, specified as regular expressions. Table 1 summarizes the
regular expressions Lex  recognizes. In this  table, c stands  for any
single character, r for a regular expression, and s for a string.

Expression  Matches                        Example
--------------------------------------------------
c           any non-operator character c   a
\c          character c literally          \*
"s"         string s literally             "**"
.           any character but newline      a.*b
^           beginning of line              ^abc
$           end of line                    abc$
[s]         any character in s             [abc]
[^s]        any character not in s         [^abc]
r*          zero or more r's               a*
r+          one or more r's                a+
r?          zero or one r                  a?
r{m,n}      m to n occurrences of r        a{1,5}
r1r2        r1 then r2                     ab
r1|r2       r1 or r2                       a|b
(r)         r                              (a|b)
r1/r2       r1 when followed by r2         abc/123
--------------------------------------------------
Table 1: Lex regular expressions (taken from [1, fig. 3.48]).

A Lex  program, or  grammar, in  general, consists  of three  sections
separated with the delimiter %%:
   definitions
   %%
   rules
   %%
   auxiliary procedures

Both definitions  and rules  section may  be empty,  and the auxiliary
procedures section may be omitted, together with the second %%.  Thus,
the minimal Lex program is
   %%
(no definitions, no rules).

The rules section of a Lex  program is a table of regular  expressions
and corresponding actions,  specifying patterns to  be matched in  the
input, and  (Turbo Pascal)  program statements  to be  executed when a
pattern has been matched:
   expression        statement;
   ...

Here,  expression  and  statement  are  delimited with whitespace. The
statement must be a single  Turbo Pascal statement (use begin  ... end
for compound statements) terminated with a semicolon; if the statement
consists of multiple  lines, the continuation  lines must be  indented
with  at  least  one  blank  or  tab  character. An action may also be
replaced by the symbol |, in which case it is assumed that the  action
for the current rule is the same as that for the next one.

As already indicated, the auxiliary procedures section is optional. If
it is present, it is assumed to contain valid Turbo Pascal code  (such
as supplementing routines, or a  main program) which is simply  tacked
on to the end of the output (Turbo Pascal) program Lex produces.

The definitions section of a  Lex program may contain regular  defini-
tions of the form
   name       expression
defining a  (regular expression)  substitution for  an identifier name
(according  to  Turbo  Pascal  syntax).  name  and  expression must be
separated by whitespace. Note that in difference to Pascal, upper- and
lowercase in identifiers is always distinct.

The value of a regular definition for name can be referred to  lateron
using the notation {name}. Thus, regular definitions provide a sort of
"constant declaration" for regular expressions.

From the source  grammar, Lex produces  an output program,  written in
Turbo Pascal, that defines a parameterless function
   function yylex : integer;
implementing the lexical analyzer.  When called, yylex reads  an input
file (standard input, by default), scanning for the patterns specified
in the source grammar, and executing the corresponding actions as pat-
terns are matched.

Normally, yylex scans the whole input file and returns with value 0 to
the calling  program upon  encountering end-of-file  (actions may also
return other  values to  the calling  program, cf.  2.2). Thus, in the
normal case, a suitable main program calling yylex is something like:
   begin
     if yylex=0 then { done }
   end.

Such a main program must be  supplied by the programmer, e.g., in  the
auxiliary procedures section (there is no default main program in  the
Lex library, as with UNIX Lex).

The   lexical   analyzer   routine   yylex   scans  for  all  patterns
simultaneously. If more than one pattern is matched, the longest match
is preferred; if there still remains more than one pattern making  the
longest match, the  first such rule  in the source  grammar is chosen.
This makes rules like
   if                       writeln('keyword if');
   [A-Za-z][A-Za-z0-9]*     writeln('identifier ', yytext);
work as expected (i.e., input if will be matched by the first, if1  by
the second rule).

A Lex program  may also be  incomplete in that  it does not  specify a
pattern for any possible input.  In such a case, the  lexical analyzer
executes a default  action on unrecognized  parts of the  input, which
consists of copying the input  to an output file (standard  output, by
default). Thus, the trivial Lex program
   %%
yields a routine that copies  the input to the output  file unchanged.
On the other  hand, if the  input has to  be absorbed completely,  the
programmer must supply rules that match everything, e.g.:
   .         |
   \n        ;

Example: The  following Lex  program counts  words (sequences  of non-
whitespace characters) in an input file:
        uses LexLib;
        var count : integer;
   %%
   [^ \t\n]+    inc(count);
   .            |
   \n           ;
   %%
   begin
     count := 0;
     if yylex=0 then writeln('word count: ', count)
   end.

A few remarks about the generated lexical analyzer routine are in  or-
der. Lex generates table-driven lexical analyzers in DFA technique [1,
section 3.7]  which usually  are both  quite compact  and fast (though
hand-coded  lexical  analyzers  will  often  be  more  efficient).  In
particular, the matching time is proportional to the length of the in-
put, unless ambiguity and  lookahead requires a significant  amount of
rescanning. There are  certain pathological regular  expressions which
cause exponential growth of the DFA table, however, they are rare.

Lex-generated lexical  analyzers interface  nicely with  Yacc, because
the  yylex  routine  just  meets  Yacc's  requirements  of its lexical
analyzer routine (actually, it was  designed that way). Thus, a  yylex
routine prepared with  Lex can be  incorporated directly into  a Yacc-
generated parser.

2.1 REGULAR EXPRESSIONS

In the Lex language regular expressions are used to denote string pat-
terns to be matched in the input stream. The basic regular expressions
are (cf. table 1):
- single characters: c stands for the literal character c itself,  un-
  less it is an  operator character (one of  *, +, ?, etc.,  discussed
  below), in which case it must  be quoted with the backslash \.  Non-
  printable characters are specified  using the C-like escapes  listed
  in table 2. Note that \0 marks end-of-file, and \n (newline)  stands
  for end-of-line, i.e. the sequence carriage-return/line-feed in  MS-
  DOS text files.

  Escape  Denotes
  ----------------------------------
  \b      backspace
  \t      tab
  \n      newline
  \nnn    character no. nnn in octal
  \\      backslash
  \c      character c literally
  ----------------------------------
  Table 2: Lex character escapes.

- strings:  "s", where  s is  any character  sequence, stands  for the
  string s. To embed the double quote " in a string, it must be quoted
  with \.
- character  classes: [s]  stands for  all characters  in s,  and [^s]
  denotes the complement of [s]. A - sign in a character class denotes
  ranges, e.g. [a-z] is the class of all lowercase letters. The period
  . is an abbreviation for the class of all characters except newline,
  i.e. [^\n]. Note that a  character class never contains the  end-of-
  file marker \0, unless it is explicitly included.

From these basic  elements, larger regular  expressions may be  formed
using the following operators:
- r*: stands for an arbitrary sequence of r's (0 or more), where r  is
  any regular expression.
- r+: stands for 1 or more r's.
- r?: stands for 0 or 1 r.
- r{m,n}: stands  for m to  n r's (where  m and n  are nonnegative in-
  tegers). r{m} denotes excactly m r's.
- r1r2: stands for r1, followed by r2, where r1 and r2  are  arbitrary
  regular expressions.
- r1|r2: stands for r1 or r2.

The operators have been listed in order of decreasing precedence, i.e.
*, +, ?  and {m,n} bind  stronger than r1r2  (concatenation), which in
turn precedes over | (alternation). Parentheses ( ... ) can be used to
group regular expressions and override default precedences.

As already  mentioned, subexpressions  may be  abbreviated with names,
using regular definitions. If name  has been defined as an  expression
r, {name} specifies the regular expression r. Note that (in difference
to UNIX Lex)  the substituted expression  r must always  be a complete
legal regular expression, which is actually substituted as an  expres-
sion,  not  textually.  This  implies  that  {name}  is  treated  as a
parenthesized expression. Also note, that any references to name  must
follow its definition, i.e. recursive definitions are illegal.

Lex also supplies a number of operators that are used to specify  left
and right context.  The context does  not actually belong  to the pat-
tern, but determines whether the pattern itself may be matched.

Right context, or lookahead is specified using the lookahead  operator
/. r1/r2 stands for  r1, but only if  followed by r2, where  r1 and r2
are arbitrary regular expressions (which may, however, not contain the
lookahead operator themselves). r$ may be used as an abbreviation  for
r/[\0\n], i.e. r followed by line end or end-of-file.

The caret ^ stands for the  beginning of the line, and thus  marks im-
mediate left context. More distant left context may be specified using
user-defined start states. The expression <s1,...,sn>r denotes a  pat-
tern  r  that  is  valid  (i.e.,  may  be matched) only if the lexical
analyzer is in any of the start states s1,...,sn. This requires one or
more start state declarations in the definitions section that list all
used start state  identifiers (which, like  expression names, must  be
legal Turbo Pascal identifiers).
   %start s1 s2 ... sn
and
   %start s1
   ...
   %start sn
are examples of valid start state declarations.

By default, the lexical analyzer is in the default start state,  which
has number  0. Lex  also assigns  unique numbers  to all  user-defined
start states, and the begin_ routine  (cf. 2.2) can be used in  an ac-
tion or any other routine to  put the lexical analyzer in the  desired
start state.

Start states are useful when certain patterns have to be analyzed dif-
ferently, depending on some left context (such as a special  character
at the beginning of the line), or when multiple lexical analyzers  are
working in  concert. Note  that a  rule without  start state prefix is
valid in the default start state, as well as in any user-defined start
state. This may be used to factor out patterns that are to be  matched
in either user-defined state.

All the  context operators  may only  appear in  rules, not in regular
definitions,  and  they  may  appear  only  once.  Of  course, context
operators may be combined, as in <s>^r1/r2 which denotes a pattern  r1
that may only be matched if it  occurs at the beginning of a line,  is
followed by an instance of r2, and if the lexical analyzer is in user-
defined start state s.

2.2 ACTIONS

A rule specifies a regular  expression pattern and an action  (a Turbo
Pascal statement) to be executed when the pattern is matched. Lex sup-
plies a number of variables and routines useful in the programming  of
actions:
- yytext: a string variable, containing the matched text.
- yyleng:  the length  of yytext,  i.e. length(yytext).  Note that the
  first and last character in yytext are yytext[1] and yytext[yyleng],
  respectively.
- yylineno: the current line  number in the input file;  useful, e.g.,
  in giving diagnostics.
- yymore:  appends the  next match  to the  current one (normally, the
  next match will overwrite the current one).
- yyless:  the  counterpart  of  yymore;  yyless(n) causes the current
  match to be  restricted to the  first n characters,  and returns the
  remaining characters at  the end of  the match to  be reread by  the
  lexical analyzer. This supplies  a crude form of  "lookahead". Since
  Lex also supports a more  general form of lookahead (cf.  2.1), this
  routine is largely obsolete.
- reject: rejects the current match and executes whatever was  "second
  choice" after  the current  rule, adjusting  the match  accordingly.
  This routine is useful when all (possibly overlapping) instances  of
  patterns have  to be  detected; see  the digram  program on the dis-
  tribution disk for an example.
- return: return(n), n an  integer value, causes yylex to  return with
  the  indicated  value.  This  routine  is  useful  when  the lexical
  analyzer is used to  partition the input file  for a parser; n  will
  then typically denote a token number (cf. 3.2).
- begin_: begin_(s)  puts the lexical  analyzer in start  state s (cf.
  2.1), where s  is either 0  (default start state)  or names a  user-
  defined start state.

These, and  other Lex-supplied  variables and  routines are  also dis-
cussed in  the interface  of the  LexLib unit  (file lexlib.pas on the
distribution disk).

2.3 LEX LIBRARY

The LexLib (Lex library) unit supplies the input/output routines  used
by the lexical analyzer. The I/O files are implemented as Pascal  text
files  yyin  and  yyout.  These  are  -  by  default - assigned to the
redirectable MS-DOS standard  input/output devices. However,  the user
program may also assign them to any suitable files/devices.

yylex accesses the input/output files through the following routines:
- function input : char;
  returns the next character from the input file.
- procedure unput(c : char);
  returns character c to the input buffer to be reread by a subsequent
  call to input.
- procedure output(c : char);
  appends character c to the output file.

All references  to the  yyin/yyout files  of the  LexLib unit are made
through these three  routines. Thus, a  user program may  well replace
them altogether  by other  routines matching  the specifications. This
makes it  possible for  the lexical  analyzer to  access arbitrary in-
put/output streams, such as special devices or internal memory.

The  LexLib  I/O  routines  and  files  may  also be accessed directly
through actions or the main program. However, care must be heeded  un-
der certain circumstances.  In particular, direct  access to the  yyin
file will bypass the buffering of unput characters and thus may  some-
times not have the desired results.

The yywrap routine is a parameterless boolean function that determines
whether the lexical analyzer  should perform normal wrapup  at end-of-
file. If it  returns true, normal  wrapup is performed;  if it returns
false,  yylex  ignores  the  end-of-file  mark  and  continues lexical
analysis. The LexLib unit supplies  a default version of yywrap  which
always returns true. This routine may be replaced by a customized ver-
sion that  does application  dependent processing  at end-of-file.  In
particular, yywrap  may arrange  for more  input and  return false  to
resume lexical analysis (see the findproc program on the  distribution
disk for an example).

Note that the LexLib unit must be loaded with (almost) any Lex program
such  that  the  lexical  analyzer  routine  may  access I/O and other
routines. To achieve this, the line
   uses LexLib;
should be put at the beginning of the Lex program (see section 2.5  on
how to incorporate Turbo Pascal source lines into the definitions sec-
tion of a Lex program).

Refer also to the LexLib interface description contained in lexlib.pas
on the distribution disk for a discussion of the LexLib I/O system and
other routines.

2.4 CHARACTER TABLES

The standard character encoding supported  both by Lex and the  LexLib
unit is ASCII  (to be more  precise, IBM's 8-bit  extension of ASCII).
However the user may supply his own versions of input, unput and  out-
put (cf.  2.3), supporting  their own  character encoding.  If such  a
customized character set is used, Lex  must be told about it by  means
of a character table in the definitions section of the Lex program.

This table has the format:
   %T
   charno.     string
   ...
   %T

Each line  of the  character table  lists a  character number  (in the
target code)  and the  corresponding ASCII  representation(s) of  this
character. The usual escapes  for non-printable characters are  recog-
nized (cf. table 2).

Example: To map the lower-  and uppercase letters into  1,...,26,  the
digits  0,...,9 into codes  27,...,36, and newline into  37, the  fol-
lowing table may be used:

   %T
   1              Aa
   2              Bb
   3              Cc
        ...
   26             Zz
   27             0
   28             1
        ...
   36             9
   37             \n
   %T

If a character table is used, all characters (at least those  actually
used in the Lex grammar) should be in the table; all character numbers
must be byte values (0..255), and no character may be mapped into  two
different codes.

2.5 TURBO PASCAL TIE-INS

Frequently, a Lex program will  not merely consist of definitions  and
rules,  but  also  use  other  routines  to be loaded with the lexical
analyzer. One example is a main program that calls the yylex  routine.
We have already  mentioned, that such  a main program,  and other sup-
plementary routines, may be placed into the auxiliary procedures  sec-
tion.

For  other  target  language  (i.e.  Turbo  Pascal)  tie-ins,  the Lex
language allows arbitrary code fragments to be included in the defini-
tions and at the beginning of the rules section of the Lex grammar, by
the following means:
- Any line in the source grammar starting in column one is assumed  to
  contain Lex code (definitions or rules), which is processed by Lex.
- Any line indented with at least one space or tab character, and  any
  sequence of lines enclosed between  %{ and %} is assumed  to contain
  Turbo Pascal code, and is copied to the output file unchanged.

Code in the  definitions section is  inserted at the  beginning of the
output program, at  global scope, while  code at the  beginning of the
rules  section  is  inserted  as  local  declarations  into the action
routine that contains the actions of all rules. Thus, an indented line
   var i : integer;
at the beginning of the rules section will declare an integer variable
local to the action statements.

As a side-effect, these conventions allow comments to be included in a
Lex program;  comments should  then follow  host language  (i.e. Turbo
Pascal) syntax.

Example: A  typical setup  for a  Lex program  with necessary declara-
tions, supplementary routines, and main program may look as follows:
         uses LexLib;
   { global declarations }
   { Lex definitions }
   %%
   { local declarations }
   { Lex rules }
   %%
   { supplementary routines }
   begin { main program }
     ...
     if yylex=0 then { done };
     ...
   end.

2.6 IMPLEMENTATION RESTRICTIONS, BUGS, AND INCOMPATIBILITIES

Lex poses some restrictions on the sizes of the input grammar and  in-
ternal tables. Maximum table sizes are  printed out by Lex at the  end
of a translation, along with  statistics about actual table space  re-
quirements. An error message table overflow can also indicate that not
enough main memory is available  to Lex (possibly because of  too many
programs loaded into memory).

Since yytext  is implemented  as a  Turbo Pascal  string variable, the
maximum size for a matched string is 255.

As implemented, the reject routine  (cf. 2.2) does not rescan  the in-
put, but uses  internal state information  to determine the  next pos-
sible  match.  Thus  reject  combined  with modifications of the input
stream may yield unexpected results.

There is a subtle  (and, as far as  I know, undocumented) bug  in Lex,
concerning certain types of lookahead patterns, that sometimes  causes
lookahead not to be restored properly. E.g., when the pattern ab*/b is
matched, the last b  is never returned to  the input, but instead  the
whole matched sequence will be returned in yytext. This actually is  a
"misfeature", and seems to be  comformant with UNIX Lex and  even with
the method of Aho/Sethi/Ullman for handling the lookahead operator [1,
section 3.8], which Lex' treatment of the lookahead operator is  based
on.

When  called,  yylex  partially  initalizes  itself. This implies that
yymore  and  reject  will  not  work  between different invokations of
yylex.

As discussed  in 2.1,  Lex substitutes  expressions, not  text, when a
regular definition is referred to with the {name} notation. This is in
contrast with the (textual) macro  expansion scheme used in UNIX  Lex.
Although the approach taken in  Turbo Pascal Lex is more  restrictive,
we feel  that it  is actually  an improvement.  In particular, regular
definitions can be parsed  and checked for validity  immediately, such
that errors  in regular  definitions can  be detected  as soon as pos-
sible. Also, the meaning of  a regular definition is guaranteed  to be
independent of the context in which it is used.

Another (minor) difference  between Turbo Pascal  and UNIX Lex  in the
syntax of regular definitions is that Turbo Pascal Lex requires  names
for  regular  expressions  to  be  legal  (Turbo  Pascal) identifiers,
whereas UNIX Lex admits arbitrary character strings as names.

3. YACC

Yacc ("Yet Another Compiler-Compiler")  is a parser generator,  i.e. a
program that translates the specification of an input language by  its
BNF (Backus Naur  Form) grammar into  a parser subroutine,  written in
Turbo Pascal.

Similar to Lex, a Yacc program or grammar has the form
   definitions
   %%
   rules
   %%
   auxiliary procedures
where the first  section contains the  definitions of the  basic input
symbols (terminals, also termed tokens) of the specified language, the
second section the  grammar rules for  the nonterminal symbols  of the
language, and the third, and optional, section contains any additional
Turbo Pascal code, such as supplementary routines and a main  program,
which will be tacked on to the end of the Turbo Pascal output program.

The rules section of  a Yacc program is  simply a BNF grammar  for the
target language, possibly  augmented with actions,  program statements
to be executed as certain  syntactic constructs are recognized by  the
parser. By  default, the  left-hand side  of the  first rule marks the
start symbol of the  grammar. It is also  possible to declare a  start
symbol explicitly by means of a declaration of the form
   %start A
in the definitions section, where A is the desired start symbol.

Grammar rules have the general format
   A : b1 ... bn;
where A is the left-hand side  nonterminal of the rule, and b1  ... bn
is the (possibly  empty) right-hand side  sequence of nonterminal  and
terminal symbols bi.

The terminating semicolon may be omitted, and several rules A : ui for
the same left-hand side nonterminal A may be abbreviated as
   A : u1
     | u2
     ...
     | un

Nonterminal symbols are denoted  by identifiers (letters, followed  by
digits and letters, where underscore _ and period . count as  letters,
and upper- and lowercase are distinct).

Terminal symbols may either be literals (single characters enclosed in
single  quotes)  or  identifiers  that  are  declared  explicitly   as
terminals by a %token definition of the form
   %token a1 ... an

By convention,  token identifiers  are given  in uppercase,  such that
they can be distinguished easily from nonterminal symbols.

In literals, the usual C-like escapes are recognized (cf. table 3).

Escape    Denotes
------------------------------------
'\n'      newline
'\r'      carriage return
'\''      single quote '
'\\'      backslash
'\t'      tab
'\b'      backspace
'\f'      form feed
'\nnn'    character no. nnn in octal
------------------------------------
Table 3: Yacc character escapes.

Grammar rules may be  augmented with actions, Turbo  Pascal statements
enclosed between { ... }. Usually, actions appear at the end of rules,
indicating the statements  to be executed  when an instance  of a rule
has been recognized (cf. 3.1).

The Yacc language is free-format:  blanks, tabs, and newlines are  ig-
nored, except when  they serve as  delimiters. Yacc language  comments
have the format:
   /* ... anything except */ ... */

As  with  Lex,  host  language  (i.e.  Turbo  Pascal)  tie-ins  may be
specified by enclosing them in %{ ... %}. Such code fragments will  be
copied unchanged,  and inserted  into the  output file  at appropriate
places (code in the definitions  section at global scope, code  at the
beginning of  the rules  section as  local declarations  of the action
routine).

The class of grammars accepted by Yacc is LALR(1) with  disambiguating
rules (cf. [1, sections 4.7  and 4.8]). Yacc can successfully  produce
parsers for a large class  of formal languages, including most  modern
programming languages (under UNIX, Yacc has been used to produce  par-
sers for C, Fortran, APL, Pascal, Ratfor, Modula-2, and others).

From the source grammar, Yacc  produces an output file containing  the
parser routine
   function yyparse : integer;
together  with  any  additional  Turbo  Pascal  code  supplied  by the
programmer.

yyparse repeatedly calls  a lexical analyzer  routine yylex to  obtain
tokens from the input file, and parses the input accordingly,  execut-
ing appropriate actions as instances of grammar rules are  recognized.
yyparse returns with a value of 0 (successful parse terminated at end-
of-file)  or  1  (fatal  error,  such  as parse stack overflow, or un-
recoverable syntax error).

Thus, a main program  that calls the parser  routine may look as  fol-
lows:
   begin
     ...
     if yyparse=0 then { done } else { error };
     ...
   end.

Main program  and lexical  analyzer routine  must be  supplied by  the
programmer. The yylex routine can  also be prepared with Lex  and then
loaded with the parser, cf. 3.2. The main program is usually  included
at the end of the auxiliary procedures section.

The following is  an example of  a Yacc grammar  for simple arithmetic
expressions. Note that  the symbol NUMBER  is declared as  a token ex-
pected to be returned by the  lexical analyzer as a single input  sym-
bol.

   %token NUMBER
   %%
   expr      : term
             | expr '+' term
             ;
   term      : factor
             | term '*' factor
             ;
   factor    : NUMBER
             | '(' expr ')'
             | '-' factor
             ;

A Lex program implementing the lexical analyzer for this Yacc  program
may look as follows:
         {$I expr.h} { definition of token numbers produced by
                       Yacc }
   %%
   [0-9]+         return(NUMBER);
   .              |
   \n             return(ord(yytext[1]));
                    { other literals returned as their character
                      codes }

3.1 ACTIONS

As already indicated,  grammar rules may  be associated with  actions,
program statements that  are executed as  rules are recognized  during
the parse. Among other things,  actions may print out results,  modify
internal data structures such as a symbol table, or construct a  parse
tree. Actions  may also  return a  value for  the left-hand  side non-
terminal, and process values  returned by previous actions  for right-
hand side symbols of the rule.

For this purpose, Yacc assigns to each symbol of a rule a  correspond-
ing value (of type integer, by default, but arbitrary value types  may
be declared, cf. 3.6): $$ denotes the value of the left-hand side non-
terminal, $i the value of  the ith right-hand side symbol.  Values are
kept on a stack maintained by  the parser as the input is  parsed. The
lifetime of  a value  begins when  the corresponding  syntactic entity
(nonterminal or terminal) has been recognized by the parser, and  ends
when the parser  reduces by an  enclosing rule, thereby  replacing the
values associated with the right-hand side symbols by the value of the
left-hand side nonterminal.

Nonterminals A obtain their values through assignments of the form  $$
:= v, where A is the  left-hand side of the corresponding rule,  and v
is some value,  usually obtained by  a function applied  to the right-
hand side values $i.

Terminals may also have associated values; these are set by the  lexi-
cal analyzer through an assignment to the variable yylval supplied  by
Yacc (cf. 3.2), as necessary.

As an example, here is an extension of the arithmetic expression gram-
mar, featuring actions that evaluate  the input expression and a  rule
for the new start symbol line  that is associated with an action  that
prints out the obtained result.

   %token NUMBER
   %%
   line      : expr '\n'         { writeln($1) }
             ;
   expr      : term              { $$ := $1 }
             | expr '+' term     { $$ := $1 + $3 }
             ;
   term      : factor            { $$ := $1 }
             | term '*' factor   { $$ := $1 * $3 }
             ;
   factor    : NUMBER            { $$ := $1 }
             | '(' expr ')'      { $$ := $2 }
             | '-' factor        { $$ := -$2 }
             ;

Note that the lexical analyzer  must set the values of  NUMBER tokens,
which are referred to by $1 in the rule factor : NUMBER. One can use a
Lex rule like
        var code : integer;
   [0-9]+         begin
                    val(yytext, yylval, code);
                    return(NUMBER)
                  end;
that applies  the Turbo  Pascal standard  procedure val  to evaluate a
NUMBER token.

Actually, we could have omitted the  "copy actions" of the form $$  :=
$1 in the above grammar, since this is the default action automatical-
ly assumed by Yacc for any rule without explicit action.

Yacc also allows actions within rules, i.e. actions that are to be ex-
ecuted before a rule has been fully parsed. A rule like
   A : b { p; } c
will be treated as if it was written
   A : b $act c
   $act : { p; }
introducing  a  new  nonterminal  $act  matched to an empty right-hand
side, and associated with the desired action.

In particular, the action { p; } is treated as if it was a  (nontermi-
nal) grammar symbol,  and thus can also return a value accessible with
the usual $i notation.  The  action  itself  may also access values of
symbols (and other actions) to the left of it. Thus, the rule  A : b {
p; } c is actually treated as if it consisted of three right-hand side
symbols; $1 denotes the value of b, $2  the value of { p; } (set in  p
by an assignment to $$) and $3 is the value of c.

Yacc's syntax-directed evaluation  scheme makes it  particularily easy
to implement synthesized attributes  along the guidelines of  [1, sec-
tion 5.6]. The  evaluation of inherited  attributes can often  also be
simulated by making use of  marker nonterminals, see also [1].  To ac-
cess marker nonterminals outside the  scope of the rule to  which they
belong, Yacc also supports the  notation $i, where i<=0, indicating  a
value to the left of the current rule, belonging to an enclosing  rule
($0 denotes the first symbol to the left, $-1 the second, ...).

Consider
   A : b B
     | b b' { $$ := $1 } B
   B : c { $$ := f($0) }

The anonymous marker nonterminal implemented by the action { $$ :=  $1
} assures that the value of b can always be accessed through $0 in the
third rule. Note that without use of the marker nonterminal, the rela-
tive position of b's value on the stack would not be predictable.

Actions within rules, and access to values in enclosing rules,  supply
flexible  means  to  implement  syntax-directed  evaluation   schemes.
However, care must be heeded that such actions do not give rise to un-
wanted parsing conflicts caused by ambiguities (cf. 3.4).

3.2 LEXICAL ANALYSIS

Yacc-generated parsers use a lexical analyzer routine yylex to  obtain
tokens from the input file. This routine must be supplied by the user,
and is assumed to return an integer value denoting a basic input  sym-
bol. 0 (or negative)  denotes end-of-file, and character  literals are
denoted by their character code. Usually, all other token numbers  are
assigned by Yacc automatically, in  the order in which %token  defini-
tions appear in  the grammar. Token  numbers may also  be assigned ex-
plicitly, by a definition of the form
   %token a n
where a is the terminal symbol  (literal or identifier), and n is  the
desired token number.

If there is a value associated with an input symbol, yylex should  as-
sign this value to the variable yylval supplied by Yacc. Usually, yyl-
val has type integer, but this default can be overwritten (cf. 3.6).

Declarations shared  by parser  and lexical  analyzer are  put in  the
header (.h) file Yacc generates along with the (.pas) output file con-
taining the parser routine. The header file declares the yylval  vari-
able and lists the token numbers; each token identifier is declared as
a corresponding integer constant.

The header file should thus be  included in a context where it  is ac-
cessible by both  the parser and  the lexical analyzer.  For instance,
one may include both header file and lexical analyzer, in that  order,
in the definitions section of the grammar, by means of the Turbo  Pas-
cal include directive ($I):
   %{
   {$I header filename }
   {$I lexical analyzer }
   %}

As has already been indicated, the lexical analyzer generator Lex dis-
cussed in section 2 of this manual is a useful tool to produce lexical
analyzers to be incorporated into Yacc-generated parsers.

3.3 YACC LIBRARY

The Yacc library unit YaccLib contains some default declarations  used
by Yacc-generated parsers. It should therefore be loaded with yyparse,
which can be achieved by the uses clause
   %{
   uses Yacclib;
   %}
at the beginning  of the Yacc  program. Note that  if the program  in-
cludes a lexical analyzer prepared with Lex, the LexLib unit may  also
be required:
   %{
   uses Yacclib, LexLib;
   %}

The routines implemented by the Yacc library are defaults that can  be
customized for the  target application. In  particular, these are  the
yymsg message printing routine and the yydebugmsg debug message print-
ing routine.  The Yacc library also  declares defaults  for the value-
type YYSTYPE (cf. 3.6) and  the size of the parser  stack (yymaxdepth;
see also 3.9).

Refer to the interface description of the YaccLib unit for further in-
formation. The interface also describes some additional variables  and
routines which are not actually  implemented by the Yacc library,  but
are contained in the Yacc output file. Some of these will also be men-
tioned in subsequent sections.

3.4 AMBIGUITY

If a grammar is non-LALR(1),  Yacc will detect parsing conflicts  when
constructing  the  parse  table.  Such  parsing  conflicts are usually
caused by inconsistencies  and ambiguities in  the grammar. Yacc  will
report  the  number  of  detected  parsing  conflicts, and will try to
resolve these  conflicts, using  the methods  outlined in  [1, section
4.8]. Thus, Yacc will generate a parser for any grammar, even if it is
non-LALR. However, if unexpected  parsing conflicts arise, it  is wise
to consult the  parser description (.lst  file, cf. 3.7)  generated by
Yacc  to  determine  whether  the  conflicts  were resolved correctly.
Otherwise the parser may not behave as expected.

An example  of an  ambigious grammar  is the  following, specifying  a
Pascal-like syntax of IF-THEN-ELSE statements:
   %token IF THEN ELSE
   %%
   stmt      : IF expr THEN stmt                /* 1 */
             | IF expr THEN stmt ELSE stmt      /* 2 */

The  ambiguity  in  this  grammar  fragment,  often referred to as the
dangling-else-ambiguity, stems from the fact that it cannot be decided
to which THEN a nested ELSE belongs: is
   IF e1 THEN IF e2 THEN s1 ELSE s2
to be interpreted as:
   (1) IF e1 THEN ( IF e2 THEN s1 ELSE s2 )
or as:
   (2) IF e1 THEN ( IF e2 THEN s1 ) ELSE s2 ?

Let us take a look at how such an ambigious construct would be parsed.
When the parser has seen IF e2 THEN s1, it could recognize (reduce by)
the first rule,  yielding the second  interpretation; but it  could as
well read ahead, shifting  the next symbol ELSE  on top of the  parser
stack, then parse s2,  and finally reduce by  rule 2, which in  effect
yields the first interpretation. Thus, upon seeing the token ELSE, the
parser is in a shift/reduce  conflict: it cannot decide between  shift
(the token ELSE) and reduce (by the first rule).

In the dangling-else example, the grammar - with some effort - can  be
rewritten to eliminate the  ambiguity, see [1, section  4.3]. However,
this is  not possible  in general  (there are  languages that  are in-
trinsically ambigious). Furthermore, an unambigious grammar may  still
be non-LALR (see  [1, exercise 4.40]  for an example).  In particular,
parsing decisions in an LALR parser are based on one-symbol-lookahead,
which limits the  class of grammars  that may be  used to construct  a
`pure' LALR parser.

As implemented, Yacc always resolves shift/reduce conflicts in  favour
of shift.  Thus a  Yacc generated  parser will  correctly resolve  the
dangling-else  ambiguity  (assuming  the  common  rule,  that  an ELSE
belongs to the last unmatched THEN).

Another type of ambiguity arises when the parser has to choose between
two different rules. Consider the following grammar fragment (for sub-
and superscripted expressions in the UNIX equation formatter eqn):
   %token SUB SUP
   %%
   expr      : expr SUB expr SUP expr /* 1 */
             | expr SUB expr          /* 2 */
             | expr SUP expr          /* 3 */

The rationale behind this example is that an expression involving both
sub- and  superscript is  often set  differently from  a superscripted
subscripted expression.

The ambiguity arises in an expression of the form e1 SUB e2 SUP e3. At
the end of the  expression, the parser can  apply both rule 1  (reduce
the whole expression) and rule 3 (reduce the subexpression e2 SUP e3).

This type of conflict is termed reduce/reduce conflict. Yacc  resolves
reduce/reduce conflicts in favour of the rule listed first in the Yacc
grammar. Thus,  "special case  constructs" like  the one  above may be
specified by listing them ahead of the more general rules. In our  ex-
ample, e1 SUB e2  SUP e3 is always  interpreted as an instance  of the
first rule (which presumably is the intended result).

To summarize, in absence of other strategies (to be discussed  below),
Yacc applies the following default disambiguating rules:
- a shift/reduce conflict is resolved in favour of shift.
- in a  reduce/reduce conflict, the  first applicable grammar  rule is
  preferred.

In any case, the number of shift/reduce and reduce/reduce conflicts is
reported by  Yacc, since  they could  indicate inconsistencies  in the
grammar. Also, Yacc  reports rules that  are never reduced  (possibly,
because they are completely ruled out by disambiguating rules). A more
detailed description  of the  detected conflicts  may be  found in the
parser description (cf. 3.7).

The  default  disambiguating  rules  are  often inappropriate in cases
where an ambigious  grammar is deliberately  chosen as a  more concise
representation of  an (unambigious)  language. Consider  the following
grammar for arithmetic expressions:
   %token NUMBER
   %%
   expr      : expr '+' expr
             | expr '*' expr
             | NUMBER
             | '(' expr ')'
             ;

There  are  several  reasons  why  such  an ambigious grammar might be
preferred over the corresponding unambigious grammar:
   %token NUMBER
   %%
   expr      : term
             | expr '+' term
             ;
   term      : factor
             | term '*' factor
             ;
   factor    : NUMBER
             | '(' expr ')'
             ;

In particular, the ambigious grammar is more concise and natural,  and
yields a more efficient parser [1, section 4.8].

The ambiguities  in the  first grammar  may be  resolved by specifying
precedences and associativity of  the involved operator symbols.  This
may be done by means of the following precedence definitions:
- %left operator symbols: specifies left-associative operators
- %right  operator  symbols:   specifies  right-associative  operators
  (e.g., exponentiation)
- %nonassoc operator symbols: specifies non-associative operators (two
  operators of the  same class may  not be combined;  e.g., relational
  operators in Pascal)

Each operator symbol  may be a  literal or a  token-identifier; token-
names  appearing  in  precedence  definitions  may,  but  need  not be
declared with %token as well.

Each precedence declaration introduces a new precedence level,  lowest
precedence first. For  the example above,  assuming '+' and  '*' to be
left-associative, and '*' to take precedence over '+', the correspond-
ing Yacc grammar is:
   %token NUMBER
   %left '+'
   %left '*'
   %%
   expr      : expr '+' expr
             | expr '*' expr
             | NUMBER
             | '(' expr ')'
             ;

This grammar unambigiously specifies how arbitrary expressions are  to
be parsed; e.g., e1+e2+e3 will  be parsed as (e1+e2)+e3, and  e1+e2*e3
as e1+(e2*e3).

Yacc  resolves  shift/reduce  conflicts  using  precedences  and   as-
sociativity in the  following manner. With  each grammar rule,  it as-
sociates the precedence of the righmost terminal symbol (this  default
may be overwritten using a %prec tag, see below). Now, when there is a
conflict between shift a  and reduce r, a  a terminal and r  a grammar
rule, and  both a  and r  have associated  precedences p(a)  and p(r),
respectively, the conflict is resolved as follows:
- if p(a)>p(r), choose `shift'.
- if p(a)<p(r), choose `reduce'.
- if p(a)=p(r), the associativity of a determines the resolution:
  - if a is left-associative: `reduce'.
  - if a is right-associative: `shift'.
  - if a is non-associative: `syntax error'.

Shift/reduce conflicts resolved with  precedence will, of course,  not
be reported by Yacc.

Occasionally, it may be necessary to explicitly assign a precedence to
a rule using  a %prec tag,  because the default  choice (precedence of
rightmost terminal) is  inappropriate. Consider the  following example
in which '-' is used both as binary and unary minus:
   %token NUMBER
   %left '+' '-'
   %left '*' '/'
   %right UMINUS
   %%
   expr      : expr '+' expr
             | expr '-' expr
             | expr '*' expr
             | expr '/' expr
             | NUMBER
             | '(' expr ')'
             | '-' expr          %prec UMINUS
             ;

UMINUS is not an actual input symbol, but serves to give unary minus a
higher precedence than any other operator. Note that, by default,  the
last rule would otherwise have the precedence of (binary) '-'.

3.5 ERROR HANDLING

This section is concerned with the built-in features Yacc provides for
syntactic error handling. By default, when a Yacc-generated parser en-
counters an  errorneous input  symbol, it  will print  out the  string
syntax error via the yymsg  routine and return to the  calling program
with the value 1. Usually, an application program will have to do bet-
ter than this, e.g. print appropriate error messages, and resume pars-
ing after syntax  errors. For this  purpose, Yacc supplies  a flexible
error recovery scheme  based on the  notion of "error  rules", cf. [1,
section 4.8 and 4.9].

The predefined token error is  reserved for error handling; it  is in-
serted at  places were  syntax errors  are expected,  yielding the so-
called error  rules. A  transition on  the error  token is never taken
during a normal parse,  but only when a  syntax error occurs. In  this
case the parser pretends it has seen an error token immediately before
the offending input symbol. Since in general the current state of  the
parser may not admit a transition on this fictitious error symbol, the
parser pops its stack until it finds a suitable state, which admits  a
transition on the  error token. If  no such state  exists, the default
error handler is used which prints out syntax error and terminates the
parse. If there is such a state, the parser shifts the error token  on
top of the stack, and resumes parsing.

To prevent cascades of error  messages, the parser then proceeds  in a
special  "error"  state  in  which  other errorneous input symbols are
quietly ignored. Normal parse is resumed only after three symbols have
been read and accepted by the parser.

As a simple example, consider the rule
   stmt  : error { action }
and assume a syntax error occurs while a statement is parsed. Then the
parser will pop its  stack, until the token  error, as an instance  of
the rule stmt : error, can  be accepted, shift the error token  on top
of the stack, reduce by the error rule immediately (executing action),
and resume parsing in error state. The effect is, that the parser  as-
sumes that a  statement (to which  error reduces) has  been found, and
skips symbols  until it  finds something  which can  legally follow  a
statement.

Similarly, the rule
   stmt  : error ';' { action }
will  cause  the  parser  to  skip  input symbols until a semicolon is
found, after which the parser  reduces by the error rule  and executes
action.

Note that error rules are not restricted to the simple forms indicated
above, but  may consist  of an  arbitrary number  of terminal and non-
terminal symbols.

Occasionally, the three-symbols resynchronization rule is  inadequate.
Consider, for example, an interactive application with an error rule
   input     : error '\n'   { write('reenter last line: ') }
               input        { $$ := $4 }
             ;

An error in an input line will then cause the parser to skip ahead be-
hind the following line end, emit the message reenter last line:,  and
read another line.  However, it will  quietly skip invalid  symbols on
the new line, until three valid  symbols have been found. This can  be
fixed by a call to the routine yyerrok which resets the parser to  its
normal mode of operation:
   input     : error '\n'   { write('reenter last line: ');
                              yyerrok }
               input        { $$ := $4 }
             ;

There are a number of other Yacc-supplied routines which are useful in
implementing better error diagnostics and handling:
- yychar: an integer variable containing the current lookahead token.
- yyclearin: deletes the current lookahead token.
- yynerrs: current total number of errors reported with yymsg.
- yyerror:  simulates  a  syntax  error;  syntactic error recovery is
  started, as if the next symbol was illegal.
- yyaccept: simulates accept action of the parser (yyparse returns 0).
- yyabort:  aborts  the  parse  (yyparse  returns  1),  as  if an un-
  recoverable syntax error occurred.

These, and other routines are  also described in the interface  of the
Yacc library (cf. 3.3).

Syntactic error handling and recovery is a difficult area; see [1] for
a more  comprehensive treatment  of this  topic. The  reader may  also
refer  to  [4]  for  a  more  detailed  explanation  of the Yacc error
recovery  mechanism  and  systematic  techniques  for developing error
rules.

3.6 ARBITRARY VALUE TYPES

As already noted, the default type for the $$ and $i values is integer
(cf. 3.1). This type can be changed by putting a declaration
   %{
   type YYSTYPE = some_type;
   %}
into the definitions section of  the Yacc program, prior to  inclusion
of the header and lexical analyzer file.

Yacc also supports explicit declaration  of a (record) value type,  by
means of a definition
   %union{
      name(s) : type;
      ...
      }

Such a definition will be translated to a corresponding (Turbo Pascal)
record declaration which is put  into the header file, and  determines
the type of stacked $$ values, as well as of the yylval variable  (cf.
3.2). "Union tags"  of the form  <name>, where name  is the name  of a
component of  the record  type, can  then be  assigned to terminal and
nonterminal symbols through %token and %type definitions,  respective-
ly.

Consider, for example, the definition:
   %union{
          integer_val : integer;
          real_val : real;
         }
and the grammar rules:
   %token INT
   %%
   expr      : expr '+' expr     { $$ := $1 + $3 }
             | expr '*' expr     { $$ := $1 * $3 }
             | INT               { $$ := $1 }
             /* ... */

To assign value type real to nonterminal expr, and type integer to the
token INT, one might say:
   %token <integer_val> INT
   %type <real_val> expr

The effect is, that Yacc  will automatically replace references to  $$
and $i values by the appropriate record tags, i.e. $$ := $1 + $3  will
be treated as $$.real_val := $1.real_val + $3.real_val, and the action
$$  :=  $1  associated  with  the  third  rule  will be interpreted as
$$.real_val := $1.integer_val.

Also, when arbitrary  value types are  used, Yacc checks  whether each
value referred to by an action has a well-defined type.

Occasionally, there  are values  whose types  cannot be  determined by
Yacc easily. This is the case for  the $$ value of an action within  a
rule, as well as  for values $i, i<=0  of symbols in enclosing  rules.
For such values the notations <name> and <name>i must be used, respec-
tively, where <name> denotes the appropriate union tag.

The expr grammar on the distribution disk is a `real-life' example  of
the use of arbitrary value types.

3.7 DEBUGGING

As  experience  shows,  debugging  a  parser  can  be  quite  tedious.
Although, with some experience, writing a grammar is a quite easy  and
straightforward  task,  implementing,  for  instance,  a  good   error
recovery scheme may be quite tricky. Yacc supplies two debugging  aids
that help verify a parser.

First  of  all,  the  parser  description  (.lst  file) is useful when
determining  whether  Yacc  correctly  resolved parsing conflicts, and
when tracing the actions of a  parser. The .lst file gives a  descrip-
tion of all generated parser states. For each state, the set of kernel
items and the parse actions are given. The kernel items correspond  to
the grammar rules processed  by the parser in  a given state; the  un-
derscore _ in an item denotes the prefix of the rule that has  already
been seen, and the suffix yet to come. The parser actions specify what
action the parser takes  on a given input  symbol. Here, the period  .
denotes the default action that is taken on any input symbol not  men-
tioned otherwise. Possible actions are:
- shift an  input symbol on  top of the  parser stack, and  change the
  parser state accordingly;
- goto a new state  upon a certain nonterminal recognized  through the
  previous reduction;
- reduce by a certain grammar rule;
- accept, i.e. successfully terminate the parse; and
- error, start syntax error recovery.
The default action in a state may either be reduce or error.

The parser description also lists parsing conflicts and rules that are
never reduced (cf. 3.4).

Consider the ambigious grammar:
   %token NUMBER
   %%
   expr      : expr '+' expr
             | expr '*' expr
             | NUMBER
             ;

This grammar, when fed into  Yacc will cause a number  of shift/reduce
conflicts. For instance, the description of parser state 5 in the .lst
file will read as follows:
   state 5:
           shift/reduce conflict (shift 3, reduce 2) on '*'
           shift/reduce conflict (shift 4, reduce 2) on '+'

           expr : expr '*' expr _  (2)
           expr : expr _ '+' expr
           expr : expr _ '*' expr

           $end    reduce 2
           '*'     shift 3
           '+'     shift 4
           .       error

As is  apparent from  this description,  the conflicts  are caused  by
missing precedences and associativities of  '+' and '*'. Also, it  can
be seen  that for  both '+'  and '*'  Yacc chose  shift, following the
default shift/reduce disambiguating rule.

Now let us resolve the conflicts in the grammar by adding  appropriate
precedence declarations:
   %token NUMBER
   %left '+'
   %left '*'
   %%
   expr      : expr '+' expr
             | expr '*' expr
             | NUMBER
             ;

Now, all conflicts are resolved and for the description of state 5  we
get:
   state 5:

           expr : expr '*' expr _  (2)
           expr : expr _ '+' expr
           expr : expr _ '*' expr

           .       reduce 2

Thus, Yacc correctly resolved the ambiguities by chosing reduction  on
any  input,  which  corresponds  to  the  higher  precedence and left-
associativity of '*'.

There are situations in which a parser seems to behave "strangely"  in
spite of  an "obviously  correct" grammar.  If the  problem cannot  be
found by careful analysis  of the grammar, it  is useful to trace  the
actions performed by the parser, to get an idea of what goes wrong.

For this  purpose a  parser may  be compiled  with defined conditional
yydebug, e.g.:
   yacc parser
   tpc parser /Dyydebug

When run, the parser  will print out the  actions it performs in  each
step (together with parser states, numbers of shifted symbols,  etc.),
which can then be followed on a hardcopy of the parser description.

Debug messages are printed via  the yydebugmsg routine (cf. 3.3).  You
may also wish to tailor this routine to your target application,  such
that it prints more informative messages.

Of course, the  discussion above was  rather sketchy. A  more detailed
treatment of these  topics, however, would  require a presentation  of
the LALR  parsing technique,  which is  well beyond  the scope of this
manual. The reader  instead is referred  to [1, sections  4.7 and 4.8]
for more information about the LALR technique. Also, [4] gives a  more
detailed explanation of (UNIX)  Yacc parse tables and  debug messages,
which can mostly be applied to Turbo Pascal Yacc accordingly.

3.8 YACC LANGUAGE GRAMMAR

This section specifies  the Yacc language  syntax, as a  Yacc grammar.
Actually, the Yacc  language is more  naturally expressed by  an LR(2)
grammar;  the  difficulty  is  to  decide  on  the  base of one-symbol
lookahead whether an identifier at the end of a rule is followed by  a
colon, in which case it starts the next rule. Thus, we distinguish the
token  C_ID  (an  identifier  followed  by  a  colon)  and  "ordinary"
identifiers ID. It is assumed to  be the task of the lexical  analysis
to determine to which of these two classes an identifier belongs.

The following grammar has been  abstracted from the Turbo Pascal  Yacc
grammar actually used to implement Turbo Pascal Yacc.

%token
  ID           /* identifier; also literals enclosed in quotes */
  C_ID         /* identifier followed by a colon */
  NUMBER       /* nonnegative integers */
  TOKEN LEFT RIGHT NONASSOC TYPE START UNION PREC
               /* reserved words: %token, etc. */
  SEP          /* separator %% */
  LCURL RCURL  /* curly braces %{ and %} */
  ',' ':' ';' '|' '{' '}' '<' '>'
               /* single character literals */

%start spec

%%

spec           : defs SEP rules aux_procs
               ;

/* auxiliary procedures section: *************************/

aux_procs      : /* empty: aux_procs is optional */
               | SEP { copy the rest of the file }
               ;
/* definitions section: **********************************/

defs           : /* empty */
               | defs def
               ;

def            : START ID
               | UNION '{' { copy the union definition } '}'
               | LCURL { copy Turbo Pascal tie-in } RCURL
               | TOKEN tag token_list
               | LEFT tag token_list
               | RIGHT tag token_list
               | NONASSOC tag token_list
               | TYPE tag nonterm_list
               ;

tag            : /* empty: union tag is optional */
               | '<' ID '>'
               ;

token_list     : token_num
               | token_list token_num
               | token_list ',' token_num
               ;

token_num      : ID
               | ID NUMBER
               ;

nonterm_list   : nonterm
               | nonterm_list nonterm
               | nonterm_list ',' nonterm
               ;

nonterm        : ID
               ;

/* rules section: ****************************************/

rules          : rule1
               | LCURL { copy Turbo Pascal tie-in } RCURL rule1
               | rules rule
               ;

rule1          : C_ID ':' body preced
               ;

rule           : rule1
               | '|' body preced
               ;

body           : /* empty */
               | body ID
               | body action
               ;

action         : '{' { copy action, substitute $$, etc. } '}'
               ;

preced         : /* empty */
               | PREC ID
               | PREC ID action
               | preced ';'
               ;

3.9 ADDITIONAL FEATURES, IMPLEMENTATION RESTRICTIONS AND BUGS

For backward compatibility, Turbo Pascal Yacc supports all  additional
language  elements  entitled  as  `Old  Features Supported But not En-
couraged' in the UNIX manual:
- literals delimited by double quotes and multiple-character literals.
- \ as a synonym for %, i.e. \\ is %%, \left is %left, etc.
- other synonyms: %< = %left,  %> = %right, %binary = %2  = %nonassoc,
  %term = %0 = %token, %= = %prec.
- actions of the form ={...} and =single statement;.
- host language tie-ins (%{...%})  at the beginning of the  rules sec-
  tion (I think that this last one is really a must).

See the UNIX Yacc manual for further information.

As with Lex, Yacc poses some restrictions on internal table sizes  for
the source grammar and the  constructed parser; these are printed  out
by Yacc  together with  statistics about  actual table  space require-
ments, after a  successful translation of  a grammar. Also,  make sure
that enough main memory is available.

The default  size of  the parser  stack is  yymaxdepth=1024 (cf.  3.3)
which should be sufficient for  any average application, but may  also
be enlarged (and shrinked) as needed. Note that right-recursive  gram-
mar rules  may increase  stack space  requirements; thus  it is a good
idea to use left-recursive (and left-associative) rules wherever  pos-
sible.

Standard  (UNIX)  Yacc  has  a  bug  that  causes some (correct) error
recovery schemes to hang in an endless loop, see [4]. This bug  should
be fixed in  the Turbo  Pascal implementation, at the cost of slightly
increased parse table sizes.

Yes, there is  (at least) one  bug in Turbo  Pascal Yacc, namely  that
%union definitions (cf.  3.6) are translated  to simple Pascal  record
types. They should be variant  records instead. This will be  fixed in
the next release, if  there ever is one.  Note that this bug  does not
affect  the  proper  functioning  of  the  parser; it merely increases
memory requirements for the parser's value stack. Anyhow, you may work
around this by using %union definitions of the (Pascal variant record)
form
   %union { case integer of
            1: ( ... ) ;
            2: ( ... ) ;
             ...
          }

A final remark  about the efficiency  of Yacc-generated parsers  is in
order. The time needed to parse  an input of length n is  proportional
to n.  Although this  may not  convince everyone  (Lex makes a similar
claim, however most Lex-based analyzers seem to be considerably slower
than hand-crafted ones), my experience is that Yacc-generated  parsers
are in  fact fast,  at least  efficient enough  for most  applications
(such as Turbo Pascal Yacc itself). The major bottleneck for  compila-
tion speed almost never seems to be the parser, but almost always  the
lexical analyzer, see [5, section 6.2]. Furthermore, one always has to
consider that manual  implementation of parsers  is usually much  more
costly, compared to the use of a parser generator.

Personally, I  prefer a  parser generator,  because I'm  really a lazy
programmer; and if something seems not to be running at optimal speed,
so what?  We can  always sit  back and  wait for  still more efficient
hardware to come (just kidding).

CONCLUSION

The Turbo Pascal Lex and  Yacc versions described in this  manual have
been designed and tested carefully. I have used them myself to  imple-
ment, among other applications: a lexical analyzer and parser for Pas-
cal (using a public domain ISO  Level 0 grammar, also included in  the
distribution); Turbo  Pascal Yacc  itself, using  bootstrapping; and a
term rewriting system compiler.

Also, quite a lot of  smaller text and data processing  and conversion
routines have been implemented by the author, and others, using  these
programs.

Personally, I feel that these  tools are quite convenient and  useful,
and  can  safe  a  lot  of  trouble  and time in software development,
although they surely could still  be improved in one direction  or the
other.

Compiler  construction  tools  are  not  only  useful for the compiler
writer, but can also be applied in the development of almost any other
software tool that,  in some sense,  defines an input  language. Also,
the use of such  utilities facilitates rapid prototyping,  and enables
the programmer to  clarify language design  issues in early  stages of
software projects.

Turbo Pascal Lex  and Yacc, as  a starting point,  bring to the  Turbo
Pascal programmer some of the merits of theoretically founded compiler
technology, and  thus may  facilitate some  of his  work in  trying to
produce good, and reliable software.

Author's  address:  Albert  Graef,  FB Mathematik, Johannes Gutenberg-
Universitaet Mainz, 6500 Mainz (FRG). Email: Graef@DMZRZU71.bitnet.

REFERENCES

[1]   Aho, Alfred V.; Ravi Sethi; Jeffrey D. Ullman: Compilers : prin-
      ciples, techniques  and tools.  Reading, Mass.:  Addison-Wesley,
      1986.

[2]   Johnson,  S.C.: Yacc  - yet  another compiler-compiler.  Murray
      Hill, N.J.: Bell Telephone Laboratories, 1974. (CSTR-32).

[3]   Lesk, M.E.: Lex  - a lexical  analyser  generator. Murray  Hill,
      N.J.: Bell Telephone Laboratories, 1975. (CSTR-39).

[4]   Schreiner, A.T.; H.G.  Friedman:  Introduction to  compiler con-
      struction with UNIX. Prentice-Hall, 1985.

[5]   Waite,  William M.;  Gerhard Goos:   Compiler construction.  New
      York:  Springer,  1985.   (Texts  and  monographs   in  computer
      science).

APPENDIX: LEX AND YACC MANUAL PAGES

NAME

Lex - lexical analyzer generator (MS-DOS/Turbo Pascal version)

SYNOPSIS

lex lex-file-name[.l] [output-file-name[.pas]]

DESCRIPTION

Lex compiles the regular expression grammar contained in lex-file-name
(default suffix: .l) to the  Turbo Pascal representation of a  lexical
analyzer for the language described  by the input grammar, written  to
output-file-name (default  suffix: .pas;  default: lex-file-name  with
new suffix .pas).

For each pattern in the input grammar an action is given, which is  an
arbitrary Turbo  Pascal statement  to execute  when the  corresponding
pattern is matched in the input stream.

The lexical  analyzer is  implemented as  a table-driven deterministic
finite automaton (DFA) routine named yylex, declared as follows:
   function yylex : integer;

The return value  of yylex may  be 0, denoting  end-of-file; all other
return  values  are  defined  by  the  programmer  and set through ap-
propriate actions.

The yylex routine can be compiled with the Turbo Pascal compiler  (tpc
or turbo). It is  to be called in  the context of a  Turbo Pascal main
program using the LexLib unit (which can be a Yacc-generated parser or
any other program in a  separate file, or incorporated into  the input
specification, and is to be supplied by the programmer).

EXAMPLE

A simple Lex program that counts words in an input file (obtained from
standard input) can be implemented as follows:
        uses LexLib;
        var count : integer;
   %%
   [^ \t\n]+    inc(count);
   .            |
   \n           ;
   %%
   begin
     count := 0;
     if yylex=0 then writeln('word count: ', count)
   end.

To compile and run this program, issue the following commands  (assum-
ing the Lex program to be in file wordcount.l):
   lex wordcount
   tpc wordcount
   wordcount <input-file

DIAGNOSTICS

In  case  of  syntactic  or  semantic  errors  in the source file, Lex
displays  source  line numbers  and   contents,  error  position   and
error message; a  copy of the  error messages is  written to the  file
lex-file-name with new suffix .lst.

NAME

Yacc - yet another compiler-compiler (MS-DOS/Turbo Pascal version)

SYNOPSIS

yacc yacc-file-name[.y] [output-file-name[.pas]]

DESCRIPTION

Yacc  compiles  the  BNF-like  grammar  contained  in   yacc-file-name
(default suffix: .y) to the Turbo Pascal representation of an  LALR(1)
parser  for  the  specified  language,  written  to   output-file-name
(default suffix: .pas; default: yacc-file-name with new suffix .pas).

Also, it generates a header file (output file name with new suffix .h)
containing declarations to  be shared between  parser and the  lexical
analyzer routine  yylex (discussed  below), and  a report  file (yacc-
file-name with  new suffix  .lst) that  contains a  description of the
generated parser.

The grammar rules in the specification can be augmented with  actions,
Turbo Pascal statements to execute when an instance of the correspond-
ing grammar rule has been matched in the input.

The parser  is implemented  as a  table-driven deterministic pushdown-
automaton routine yyparse, that performs a non-backtracking, bottom-up
shift/reduce parse. yyparse is declared as follows:
   function yyparse : integer;

The return value of this  function is either 0 (normal  termination of
the parse) or 1 (exception occurred during the parse, e.g. stack over-
flow,  unrecoverable  syntax   error,  or  programmer   action  called
yyabort).

The yyparse  routine can  be compiled  with the  Turbo Pascal compiler
(tpc or turbo). It  is to be called  in the context of  a Turbo Pascal
main program using the YaccLib unit and a lexical analyzer routine
   function yylex : integer;
which can  also be  prepared with Lex;  these  must be supplied by the
programmer.  The header file Yacc generates summarizes declarations to
be shared between parser and the yylex routine.

If the yyparse routine  is compiled with defined  conditional yydebug,
i.e.
   tpc  filename /Dyydebug
yyparse will trace all parsing actions on standard output.

EXAMPLE

The sample desktop calculator  supplied on the distribution  disk con-
sists of the main program and input grammar in file expr.y and a lexi-
cal analyzer in the Lex source file exprlex.l. It can be compiled  and
run by issuing the commands:
   yacc expr
   lex exprlex
   tpc expr
   expr

To trace the steps made by the parser, compile expr.pas with
   tpc expr /Dyydebug

DIAGNOSTICS

When encountering syntactic  or semantic errors  in an input  grammar,
Yacc  gives  diagnostics  on  standard  output  and in the report file
(yacc-file-name with new suffix .lst).

Upon successful compilation of the input grammar, Yacc will report the
number of  shift/reduce and  reduce/reduce conflicts  encountered when
constructing the parser (if there are any); also, Yacc will report the
number of grammar rules that are never used in a reduction, and  issue
warnings when nonterminal  grammar symbols do  not appear on  the left
side  of  at  least  one  grammar  rule.  Such  items,  in particular:
shift/reduce and reduce/reduce conflicts, are discussed in more detail
in the report (.lst) file.
