Parser generator in Python: kwParsing
=====================================

:Author: Aaron Watters
:Version: $Revision: 1.3 $

This   is   the   documentation   for  the  kjParsing  package,  an
experimental parser generator implemented in Python which generates
parsers  implemented  in  Python.  It  won't  serve  as  a complete
reference on programming language syntax and interpretation, but it
will  review  terminology  for  the knowledgable and I hope it will
pique the interest of the less experienced.

Introduction
------------

The  kjParsing  package  is a parser generator written in Python which
generates parsers for use in Python.

These modules and their documentation and demo files may be of use for
classes  on  parsing,  compiling, or formal languages, and may also be
helpful  to  people  who  like  to create experimental interpreters or
translators or compilers.

The  package consists of three Python modules: kjParser, kjParseBuild,
and  kjSet.  Together  these modules are called the kjParsing package.
The  package  also  includes  some  documentation and demo files and a
COPYRIGHT   file   which  explains  the  conditions  for  copying  and
propagating  this  code  and  the  fact  that  the  author  assumes no
responsibility  for  any  difficulties  resulting from the use of this
package by anyone (including himself).

What a Parser Does
------------------

Parsers  can  be  part  of  a  lot  of  different  things:  compilers,
interpreters,   translators,   or   code   generators,  among  others.
Nevertheless,  at  an abstract level parsers all translate expressions
of a language into computational actions.

Parsers  generated  by  the kjParseBuild module may do three different
sorts of actions:

Value Computation
  The parser may build a data structure as the result of the expression.
  For example the silly LispG grammar from the file "DLispShort.py"
  can construct integers, strings and lists from string representations.

    >>> from DLispShort import LispG, Context
    >>> LispG.DoParse1( ' ("list with string and int" 23) ', Context)
    ['list with string and int', 23]
    >>>

Environment Modification
  The parser may modify the context of the computation. For example the
  LispG grammar allows the assignment of values to internal variable
  names.

    >>> LispG.DoParse1( '(setq Variable (4 5 9))', Context)
    [4, 5, 9]
    >>> Context['Variable']
    [4, 5, 9]
    >>>

  (Here the second result indicates that the string 'Variable' has been
  associated with the value [4,5,9] in the Context structure, which in
  this case is a simple python dictionary.)

External Side Effects
  The parser may also perform external actions. For example the LispG
  grammar has the ability to print values to the terminal.

    >>> LispG.DoParse1( '( (print Variable) (print "bye bye")  )', Context )
    [4, 5, 9]
    bye bye
    [[4, 5, 9], 'bye bye']
    >>>

  (Here the first two lines are the results of printing and the last is
  the value of the expression.)

  More  realistic  parsers  will  perform  more  interesting actions, of
  course.

To  implement  a parser using kjParseBuild you must define the grammar
to  parse  and associate each rule and terminal of the grammar with an
action  which  defines  the  computational  meaning  of  each language
construct.

The  grammar generation process consists of two phases

Generation
  During this phase you must define the syntax of the language and
  function bindings that define the semantics of the language. When
  you've debugged the syntax and semantics you can dump the grammar
  object representing the syntax only to a grammar file which can be
  reloaded without re-analyzing the language syntax. For large grammars
  each regeneration may require significant time and computational
  resources.

Use
  During this phase you may load the grammar file without re-analyzing
  the grammar on each use. However, the semantics functions must still
  be rebound on each load. The reloaded grammar object augmented with
  interpretation functions may be used to parse strings of the language.

  Note  that the functions that define the semantics of the language are
  must be bound in both phases.

  A function for _`building a simple grammar`::

   1   # from file DLispShort.py (with small differences)
   2   def buildSimpleGrammar():
   3       import kjParseBuild
   4       LispG = kjParseBuild.NullCGrammar()
   5       LispG.SetCaseSensitivity(0)
   6       DeclareTerminals(LispG)
   7       LispG.Keywords("setq print")
   8       LispG.punct("().")
   9       LispG.Nonterms("Value ListTail")
   10      LispG.comments([LISPCOMMENTREGEX])
   11      LispG.Declarerules(GRAMMARSTRING)
   12      LispG.Compile()
   13      LispG.MarshalDump('testlisp_mar.py')
   14      BindRules(LispG)
   15      return LispG


Defining a Grammar
------------------

A  programming language grammar is conventionally divided into several
components:

Keywords
  These are special strings that "highlight" a language construct.
  Familiar keywords from Python and Pascal and C are "if", "else",
  and "while".

Terminals
  These are special patterns of characters that indicate a value in the
  language. For example many programming languages will classify the
  string 123 as an instance of the integer nonterminal and the string
  snark (not contained in quotes) as an instance of the nonterminal
  identifier or variable. Terminals are usually restricted to very
  simple constructs like identifiers, numbers, and strings. More complex
  things (such as a "date" data type) might be better handled by
  nonterminals and rules.

Nonterminals
  These are "place holders" for language constructs of the grammar.
  They represent parts of the grammar which sometimes expand to great
  size and complexity. For instance the C language grammar presented by
  Kernigan and Ritchie has a nonterminal translationUnit which
  represents a complete C language module, a nonterminal
  conditionalExpression which represents a truth valued expression of
  the language.

Punctuations
  These are special characters or strings which are recognized as
  separate entities even if they aren't physically separated from other
  strings by white space. For example, most languages would "see" the
  string if0 as a single token (probably an identifier) even if if is a
  keyword, whereas if(0) would be recognized the same as if ( 0 )
  because parentheses are normally considered punctuations. Except for
  the special treatment at recognition, punctuations are similar to
  keywords.

The  syntax of a language describes how to recognize the components of
the  language. To define a language syntax using kjParseBuild you must
create  a  null  compilable  grammar object to contain the grammar (in
`building a simple grammar`_ this is done on line 3 using the class constructor
kjParseBuild.NullCGrammar()  creating  the  grammar  object LispG) and
define the components of the grammar and the rules for recognizing the
components.  The  component definitions and rule declarations, as well
as  the  specification  of  case sensitivity and comment patterns, are
performed  on  lines 4 through 10 of `building a simple grammar`_ for the LispG
grammar.

Declaring Case Sensitivity and Comments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There  are  two nuances to parsing not yet mentioned: case sensitivity
and comments.

Some  grammars  are  not  case  sensitive  in  recognizing keywords or
identifiers.  For  example  ANSI  standard  SQL  (which  is  not  case
sensitive  for  keywords  or  identifiers)  recognizes Select, select,
SELECT,  and  SeLect  all  as  the keyword SELECT. To specify the case
sensitivity of the grammar for keywords only use::

 GRAMMAROBJECT.SetCaseSensitivity(TrueOrFalse)

where  TrueOrFalse  is  0  for  no  case  sensitivity  or  1  for case
sensitivity. This must be done before any keyword declarations for the
grammar. All other syntax declarations may be done in any order before
the  compilation  of  the  grammar  object. In `building a simple grammar`_ the
LispG grammar object is declared to be case insensitive on line 4.

Comments  are  patterns in the input string which are ignored (or more
precisely  interpreted  as  white space) by the language. To declare a
sequence  of  regular  expressions to be interpreted as a comment in a
grammar use::

  GRAMMAROBJECT.comments(LIST_OF_COMMENT_REGULAR_EXPR_STRINGS)

For  example,  line  9  or  `building a simple grammar`_ declares the constant
string previously declared as::

  LISPCOMMENTREGEX = ";.*"

to represent a comment of the grammar LispG. For the syntax of regular
expression  strings  you  must  look  elsewhere,  but  as a hint ";.*"
represents  any  string  commencing  with a semicolon, followed by any
sequence of characters up to, but not including, a newline.

Declaring Keywords, Punctuations, and Terminals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To declare keywords for your grammar use::

  GRAMMAROBJECT.Keywords( STRING )

where  STRING is a white space separated string of keywords. Line 6 of
`building a simple grammar`_ declares setq and print as keywords of LispG.

To declare nonterminals for your grammar, similarly, use::

  GRAMMAROBJECT.Nonterms( STRING )

where  STRING  is a white space separated string of nonterminal names.
Line   8  of  `building a simple grammar`_  declares  Value  and  ListTail  as
nonterminals of the LispG.

Similarly, use::

  GRAMMAROBJECT.punct( STRING )

to  declare a sequence of punctuations for the grammar, except that in
this  case  the  string  must  not  contain any white space. Line 7 of
`building a simple grammar`_ declares parentheses and dot to be punctuations of
the LispG.

If  you  have a lot of keywords, punctuations, or nonterminals you can
make  many  separate calls to the appropriate declaration methods with
different strings.

These  declarations  will  cause the grammar to recognize the declared
keyword  strings  (when separated from other strings by white space or
punctuations) and punctuations as special tokens of the grammar at the
lowest  level  of parsing. The parsing process derives nonterminals of
the grammar at a higher level as discussed below.

A  small  difficulty with kjParseBuild is that the strings ``@R``, ``::``,
``>>``,
and  ``##``  cannot  be  used as names of keywords for the grammar because
they  are  used  to specify rule syntax in the "metagrammar". If you
need  these  in  your  grammar  they may be implemented as "trivial"
terminals. For example::

             Grammar.Addterm("poundpound", "##", echo)

I'm  unsure  whether  this  patch is good enough. Does anyone have any
advice  for me? If this is a bad problem for some grammar the keywords
of the meta grammar can be changed of course, but this is a hack.

Declaring Terminals

Defining the terminals of a grammar::

    # from DLispShort.py
    def DeclareTerminals(Grammar):
         Grammar.Addterm("int", INTREGEX, intInterp)
         Grammar.Addterm("str", STRREGEX, stripQuotes)
         Grammar.Addterm("var", VARREGEX, echo)

This shows the declarations for installing the int, str, and
var  terminals  in  the  grammar. This is given as a separate function
because  the declarations define both the syntax and semantics for the
terminals, and therefore must be called both during grammar generation
and  after loading the generated grammar object. To declare a terminal
for a grammar use::

  GRAMMAROBJECT.Addterm(NAMESTR, REGEXSTR, FUNCTION)

This  declaration associates both a regular expression string REGEXSTR
and an interpretation function FUNCTION to the terminal of the grammar
named by the string NAMESTR. The FUNCTION defines the semantics of the
terminal  as  describe  below  and  the  REGEXSTR  specifies a regular
expression for recognizing the string. For example on line 2 of Figure
TermDef  the  var  terminal  is associated with the regular expression
string::

  STRREGEX = '"[^\n"]*"'

which  matches  any string starting with double quotes and ending with
double quotes which contains neither double quotes nor a newline.

Declaring Rules of the Grammar
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _`grammar string`:

A grammar definition string::

  # from DLispShort.py
  GRAMMARSTRING ="""
     Value ::  ## indicates Value is the root nonterminal for the grammar
       @R SetqRule :: Value >> ( setq var Value )
       @R ListRule :: Value >> ( ListTail
       @R TailFull :: ListTail >> Value ListTail
       @R TailEmpty :: ListTail >> )
       @R Varrule :: Value >> var
       @R Intrule :: Value >> int
       @R Strrule :: Value >> str
       @R PrintRule :: Value >> ( print Value )
  """

To  declare  the  rules  of  a  grammar use the simple rule definition
language which comes with kjParseBuild, for example as shown in Figure
GramStr.  Line  10  of  `building a simple grammar`_ uses the string defined above
to associate the rules with the grammar using::

  GRAMMAROBJECT.DeclareRules(RULE_DEFINITION_STRING)

This   declaration   does   not   analyse  the  string;  analysis  and
syntax/semantics errors are reported by ``*.Compile()`` described below.

The   rule  definition  language  allows  you  to  identify  the  root
nonterminal of your grammar and specify a sequence of named derivation
rules for the grammar. It also allows comments which start with ``##`` and
end  with  a  newline.  An  acceptible  string for the rule definition
language looks like::

          RootNonterminalName :: NamedRule1 NamedRule2 ...

Here the Root nonterminal name should be the nonterminal that "stands
for"  any  complete  string  of the language. Furthermore, each named
rule looks like::

              @R NameString :: GoalNonterm >> RuleBody

where the name string for the rule is a string without whitespace, the
goal  nonterminal  is  the  nonterminal that the rule derives, and the
rule  body  is  a  sequence of keywords, punctuations and nonterminals
separated  by  white  space.  Rule names are used for mapping rules to
semantic interpretations and should be unique.

Note  that  punctuations  for  the  grammar  you  are defining are not
punctuations  for  the  rule  definition language (which has none), so
they  must  be separated from other tokens by white space. The keyword
for  the  rule  definition  language ``@R``, ``::``, ``>>`` must also be separated
from  other  tokens  by  whitespace  in  the  rule  definition string.
Furthermore,  all  punctuations, keywords, nonterminals, and terminals
used  in the rules must be declared for the grammar before the grammar
is compiled (if one isn't the compilation will fail with an error).

As a bit of sugar you may break up the declarations of rules::

  LispG.DeclareRules("Value::\n")
  LispG.DeclareRules("  @R SetqRule :: Value >> ( setq var Value )\n")
  LispG.DeclareRules("  @R ListRule :: Value >> ( ListTail\n")
  ...

This might be useful for larger grammars.

A Brief Discussion of Derivations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The rules for a grammar don't really describe how to parse a string of
the  language,  they  actually  describe how to derive a string of the
grammar.  For  this  reason  it  is possible to create a grammar which
derives  the  same  string  in  two  different ways; such grammars are
termed  ambiguous.  If  you  try to generate a parser for an ambiguous
grammar  the parse generation process will cause the parser generation
process to complain.

For  a  more precise definition of the derivation of a language string
from  a  grammar  see the "further readings" below. For illustrative
purposes,  and  to  help  explain  how  to define semantics functions,
consider the following derivation of the string::

                     ( 123 ( setq x "this" ) )

using the rules declared above (`grammar string`_):

  +------------------------------------+------------+
  |  Derivation                        | Rule used  |
  +====================================+============+
  |  Value1 >> ( ListTail1             | ListRule   |
  +------------------------------------+------------+
  |  ListTail1 >> Value2 ListTail2     | TailFull   |
  +------------------------------------+------------+
  |  Value2 >> [int = 123]             | Intrule    |
  +------------------------------------+------------+
  |  ListTail2 >> Value3 ListTail3     | TailFull   |
  +------------------------------------+------------+
  |  Value3 >> (setq [var='x'] Value4) | SetqRule   |
  +------------------------------------+------------+
  |  Value4 >> [string='this']         | StrRule    |
  +------------------------------------+------------+
  |  ListTail3 >> )                    | TailEmpty  |
  +------------------------------------+------------+

To  obtain the string derived we simply substitute the representations
derived  for  each  of  the numbered nonterminals and terminals of the
derivation. So the right-to-left derivation steps for (123 (setq x "this"))
are:

  +-----+------------------------------------+-------------+
  | (1) |  Value1                            |             |
  +-----+------------------------------------+-------------+
  | (2) |  ( ListTail1                       | (ListRule)  |
  +-----+------------------------------------+-------------+
  | (3) |  ( Value2 ListTail2                | (TailFull)  |
  +-----+------------------------------------+-------------+
  | (4) |  ( 123 ListTail2                   | (Intrule)   |
  +-----+------------------------------------+-------------+
  | (5) |  ( 123 Value3 ListTail3            | (TailFull)  |
  +-----+------------------------------------+-------------+
  | (6) |  ( 123 ( setq x Value4 ) ListTail3 | (SetqRule)  |
  +-----+------------------------------------+-------------+
  | (7) |  ( 123 ( setq x "this" ) ListTail3 | (StrRule)   |
  +-----+------------------------------------+-------------+
  | (8) |  ( 123 ( setq x "this" ) )         | (TailEmpty) |
  +-----+------------------------------------+-------------+


Compiling the Grammar Syntax, and Storing the Compilation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once   you   have  defined  all  the  keywords,  comments,  terminals,
nonterminals,  punctuations,  and rules of your grammer you may create
the datastructures needed for parsing by compiling the grammar using::

  GRAMMAROBJECT.Compile()

Line  11 of `building a simple grammar`_ performs the compilation for the LispG
grammar.

If the compilation succeeds you may use::

  GRAMMAROBJECT.MarshalDump( OUTPUTFILE )

to  store  the  compiled grammar structure to a file that may be later
loaded without recompiling the grammar. Here MarshalDump will create a
binary   "marshalled"   representation   for   the  grammar  in  the
OUTPUTFILE.  For  example  line  13 of `building a simple grammar`_ marshalls a
representation for LispG to the file testlisp_mar.py. TESTLisp.GRAMMAR()
will  then  reconstruct  the  internal structure of LispG as a grammar
object and return the grammar object as the result of the function.

Nevertheless,  compilation  of  the grammar by itself does not yeild a
grammar  that  will  do  any  useful  parsing  [Actually,  it  will do
"parsing"  using  default  actions  (implemented as a function which
simply  return  the  list  argument).]  Rules  must be associated with
computational actions before useful parsing can be done.

Defining a Semantics
~~~~~~~~~~~~~~~~~~~~

Two  sorts  of  objects  require  semantic  actions  that define their
meaning:  rules and terminals. All semantic actions must be defined as
Python  functions  and  bound  in  the  grammar  before parsing can be
performed.

Before  you  can  define  the semantics of your language in Python you
better  have a pretty good idea of what the components of the language
are   supposed   to   represent,   of  course.  Using  your  intuitive
understanding of the language you can:

Decide what the context of the computation should be and how it should
be implemented as a Python structure. If the process of Parsing must
modify the context, then then the context structure must be a
"mutable" python structure. In the case of LispG the context is
simply a structure that maps "internal" variable names to values,
implemented as a simple Python dictionary mapping name strings to the
appropriate value.
Decide what kind of Python value each terminal of the grammar
represents. In the case of LispG

str
  should represent a string value corresponding to the string recognized
  (minus the surrounding quotes).
int
  should represent an integer value corresponding to the string
  recognized.
var
  should represent the string representing the variable name recognized
  (the name must be translated to a corresponding value at a higher
  level since the terminal interpretation functions don't have access to
  the context structure).

Decide  what  kind  of  Python  structure  or  value  each nonterminal
represents. In the case of the LispG grammar:

Value
  represents a Python integer, string, or list.
ListTail
  represents a Python list containing the members of the tail of a list.

Decide  how  each  rule should derive a structure corresponding to the
Goal (left hand side) of the rule based on the values corresponding to
the  terminals and nonterminals on the right hand side of the rule. In
the  case  of  the  LispG  grammar  (refer  to Figure GramStr for rule
definitions):

SetqRule
  should return whatever the Value terminal in the body represents.
ListRule
  should return the list represented by the ListTail nonterminal of the
  body.
TailFull
  should return the result of adding the value corresponding to the
  Value nonterminal of the list to the front of the list corresponding
  to the Listtail nonterminal of the body.
Varrule
  should return the value from the computational context that
  corresponds to the variable name represented by the var terminal of
  the body.
Intrule
  should return the integer corresponding to the int terminal of the
  body.
Strrule
  should return the string corresponding to the str terminal of the
  body.
PrintRule
  should return the value represented by the Value nonterminal of the
  body.

Decide  what  side  effects,  if  any,  each  rule  should have on the
computational context or externally. In the case of the LispG grammar:

SetqRule
  should associate the variable name represented by var to the value
  represented by Value in the body.
PrintRule
  should print the value corresponding to the Value nonterminal to the
  screen.

The  other  rules  of  LispG  should have no internal or external side
effects.

More  complex languages may require much more complex contexts, values
and  side  effects,  including function definitions, modules, database
table   accesses,   user   authorization  verifications,  and/or  file
creation, among other possibilities.

Having  determined the intuitive semantics of the language you may now
specify  implement  the  semantic  functions  and  bind  them  in your
grammar.

Semantics for Terminals
~~~~~~~~~~~~~~~~~~~~~~~

To  define the meaning of a terminal you must create a Python function
that  translates  a  string  (which  the  parser  has recognized as an
instance  of  the  terminal)  into an appropriate value. For instance,
when the LispG grammar recognizes a string::

  "this is a string"

the  interpretation  function  should  translate the recognized string
into the Python string it represents: namely, the same string but with
the  double  quotes stripped off. The following "string intepretation
function" will perform this simple interpretation. So::

    # from DLispShort.py
    def stripQuotes( str ):
        return str[1:len(str)-1]

Similarly,  when  the  parser  recognizes  a string as an integer, the
associated  interpretation function should translate the string into a
Python integer.

The binding of interpretation functions to terminal names is performed
by  the  Addterm  method  previously mentioned. For example, line 2 of
Figure  TermDef associates the stripQuotes function to the nonterminal
named str.

All  functions  passed to Addterm should take a single string argument
which  represents  the  recognized  string,  and  return a value which
represents the semantic interpretation for the input string.

Semantics for Rules
~~~~~~~~~~~~~~~~~~~

The  semantics  of  rules is more interesting since they may have side
effects  and  require  the  kind of recursive thinking that gives most
people  headaches. The semantics for rules are specified by functions.
To perform the semantic action associated with a rule, the "reduction
function"  should  perform  any  side  effects  (to the computational
context  or  externally) and return a result value that represents the
interpretation for the nonterminal at the head of the rule. The reduction
functions for the rules::

  # from DLispShort.py
  def EchoValue( list, Context ):
      return list[0]

  def VarValue( list, Context ):
      varName = list[0]
      if Context.has_key(varName):
         return Context[varName]
      else:
         raise NameError, "no such lisp variable in context "+varName

  def NilTail( list, Context ):
      return []

  def AddToList( list, Context ):
      return [ list[0] ] + list[1]

  def MakeList( list, Context ):
      return list[1]

  def DoSetq( list, Context):
      Context[ list[2] ] = list[3]
      return list[3]

  def DoPrint( list, Context ):
      print list[2]
      return list[2]

Binding named rules to interpretation functions::

    # from DLispShort.py
    def BindRules(LispG):
        LispG.Bind( "Intrule", EchoValue )
        LispG.Bind( "Strrule", EchoValue )
        LispG.Bind( "Varrule", VarValue )
        LispG.Bind( "TailEmpty", NilTail )
        LispG.Bind( "TailFull", AddToList )
        LispG.Bind( "ListRule", MakeList )
        LispG.Bind( "SetqRule", DoSetq )
        LispG.Bind( "PrintRule", DoPrint )


The  Python  functions that define the semantics of the rules of LispG
appear  above and the declarations that bind the rule names
to  the  functions  in  the  grammar  object  LispG  appear  in Figure
ruleBind.

Each "reduction function" for a rule must take two arguments: a list
representing  the  body  of  the  rule  and  a context structure which
represents  the  computational  context  of  the computation. The list
argument  will  have the same length as the body of the rule, counting
the   keywords   and   punctuations  as  well  as  the  terminals  and
nonterminals.

For example the SetqRule has a body with five tokens::

  @R SetqRule :: Value >> ( setq var Value )

so  the  DoSetq  function should expect the parser to deliver a Python
list argument with five elements of form::

  list = [ '(', 'SETQ', VARIABLE_NAME, VALUE_RESULT, ')' ]

note  that  the  "names"  of keywords and punctuations appear in the
appropriate positions (0, 1, and 4) of the list corresponding to their
positions  in  SetqRule.  Furthermore,  the  position  occupied by the
terminal  var in SetqRule has been replaced by a string representing a
variable name in the list and the position occupied by the nonterminal
Value in SetqRule has been replaced by a Python value.

More  generally,  the  parser  will call reduction functions for rules
with a list representing the "interpreted body of the rule" where

keywords and punctuations
  are interpreted as themselves (i.e., their names), except that letters
  will be in upper case if the grammar is not case sensitive;

terminals
  are interpreted as values previously returned by a call to the
  appropriate terminal interpretation function; and

nonterminals
  are interpreted as values previously returned by a reduction function
  for a rule that derived this terminal.

Although,  the  occurrence  of  the keyword names in the list may seem
useless,  it  may have its purposes. For example, a careful programmer
might  check them during debugging to make sure the right function was
bound to the right rule.

To  determine  how to implement the semantics of a rule you must refer
to  the  semantic  decisions  you  made earlier. For example, above we
specified  that  the  setq  construct  should  bind  the variable name
recieved ( list[2]) to the value ( list[3]) in the Context, and return
the  value ( list[3]) as the result of the expression. Translated into
the  more concise language of Python this is exactly what DoSetq shown
in Figure RedFun does.

To bind a rule name to a (previously declared) reduction function use::

  GRAMMAROBJECT.Bind( RULENAME, FUNCTION )

where RULENAME is the string name for the rule previously declared for
the  grammar  GRAMMAROBJECT  and FUNCTION is the appropriate reduction
function  for  the  rule. These bindings for LispG are shown in Figure
ruleBind.

A Bit on the Parsing Process
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The  following is not a precise definition of the actions of a Parser,
but  it  may help you understand how the parsing process works and the
order in which rules are recognized and functions are evaluated.
Parsing ``(123 (setq x "this"))``:

+-+---------------------------+----------------------+----------------------+
| |Tokens seen S              | input remaining      | rule R and           |
| |                           |                      | function call        | 
+=+===========================+======================+======================+
|0|                           |(123 (setq x "this")) |                      |
+-+---------------------------+----------------------+----------------------+
|1|   ( 123                   |(setq x "this"))      |Intrule               |
+-+---------------------------+----------------------+----------------------+
| |                           |                      |Value2 =              |
| |                           |                      |EchoValue([123],C))   |
+-+---------------------------+----------------------+----------------------+
|2|( Value2 ( setq x "this"   |))                    | StrRule              |
+-+---------------------------+----------------------+----------------------+
| |                           |                      |Value4 =              |
| |                           |                      |EchoValue(['this'],C) |
+-+---------------------------+----------------------+----------------------+
|3|( Value2 ( setq x Value4 ) |)                     |SetqRule              |
+-+---------------------------+----------------------+----------------------+
| |                           |                      |Value3 = DoSetq(['(', |
| |                           |                      |'SETQ','x',Value4,')']|
| |                           |                      |,C)                   |
+-+---------------------------+----------------------+----------------------+
|4|( Value2 Value3 )          |                      |TailEmpty             |
+-+---------------------------+----------------------+----------------------+
| |                           |                      |ListTail3 =           |
| |                           |                      |NilTail([')'],C)      |
+-+---------------------------+----------------------+----------------------+
|5|( Value2 Value3 ListTail3  |                      |TailFull              |
+-+---------------------------+----------------------+----------------------+
| |                           |                      |ListTail2 =           |
| |                           |                      |AddToList([Value3,    |
| |                           |                      |ListTail3],C)         |
+-+---------------------------+----------------------+----------------------+
|6|( Value2 ListTail2         |                      |TailFull              |
+-+---------------------------+----------------------+----------------------+
| |                           |                      |ListTail3 =           |
| |                           |                      |AddToList([Value2,    |
| |                           |                      |ListTail2],C)         |
+-+---------------------------+----------------------+----------------------+
|7|( ListTail3                |                      |ListRule              |
+-+---------------------------+----------------------+----------------------+
| |                           |                      |Value1 =              |
| |                           |                      |MakeList(['(',Value1],|
| |                           |                      |C)                    |
+-+---------------------------+----------------------+----------------------+
|8|Value1                     |                      |                      |
+-+---------------------------+----------------------+----------------------+


Technically, each entry of S is tagged with the kind of token it
represents (keyword, nonterminal, or terminal) and the name of the
token it represents (e.g., Value, str) as well as the values shown.

The table above illustrates the sequence of reduction actions performed
by  LispG  when parsing the input string (123 (setq x "this")). We can
think  of  this parse as "reversing" the derivation process shown in
Figure  Derive  using  the rule reduction functions to obtain semantic
interpretations for the nonterminals.

At  the lowest level of parsing a lexical analyser examines the unread
portion  of  the  input  string  tries  to match a prefix of the input
string with a keyword or a regular expression for a terminal (ignoring
comments   and   whitespace,   except  as  separators).  The  analyser
"passes"  the  recognized  token to the higher level parser together
with  its  interpreted  value.  The interpreted value of a terminal is
determined  by  using  the appropriate interpretation function and the
interpreted  value  of a keyword is simply its name (in upper case, if
the  grammer  is  not  case  sensitive). For example the LispG lexical
analyser  recognizes '(' as a keyword with the value '(' and "this" as
an instance of the nonterminal str with the value 'this'.

The higher level parser accepts tokens T from the lexical analyser and
does one of two things with them

If the most recent token values V the parser has saved on its "tokens
seen" stack S "looks like" the body B of a rule R and the current
token is a token that could follow the nonterminal N at the head of R,
then the parser evaluates the reduction function F associated with R,
using the values V from the stack S that match the body B together
with the computational context C. The resulting value F(V,C) replaces
the values V
Otherwise the current token is shifted onto the "tokens seen" stack
S and the parser moves on to the next token.

The above is a lie. Actually, the parsing process is much smarter than
this, but from a users perspective this simplification may be helpful.

Figure  Parse  shows  "reduction"  steps and not the "shifts", and
glosses   over   the  lexical  analysis  and  other  nuances,  but  it
illustrates  the  idea of the parsing process nonetheless. For example
at  step  2  the  parse  recognizes  the last token on the stack S (an
instance of the "str" terminal with value "this") as matching the body
of  StrRule,  and  replaces it with the an instance of the nonterminal
Value  with value determined by the reduction of StrRule. In this case
StrRule  is  associated  with the reduction function EchoValue, so the
result  of the reduction is given by EchoValue( 'this', C ) where C is
the context structure for the Parse.

At Step 3 the most recent entries of S::

  V = ['(', 'SETQ', 'x', Value4, ')']

match  the  body of the rule SetqRule, so they are replaced on S by an
instance of the Value nonterminal with value determined by::

  Value3 = DoSet( V, C )

Finally,  at  step  8,  the  interpretation associated with Value1 (an
instance  of  the root nonterminal for LispG) is considered the result
of the computation.

Parsing with a Grammar
~~~~~~~~~~~~~~~~~~~~~~

Before   you   can   perform  a  parse  you  probably  must  create  a
computational  context for the parse. In the case of LispG the context
is simply a dictionary so we may initialize::

  Context = {}

To create a context for Parsing.

There  are  two  methods  which provide the primary interfaces for the
parsing process for a grammar::

  RESULT = GRAMMAROBJECT.Parse1(STRING, CONTEXT)
  (RESULT, CONTEXT) = GRAMMAROBJECT.Parse(STRING, CONTEXT)

The  second  allows you to make explicit in code that uses parsing the
possibility  that  a parse may alter the context of the parse -- aside
from  that  the  two functions are identical. Example usage for Parse1
using LispG were given earlier.

Storing and Reloading a Grammar Object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The  process  of  compiling  a  grammar  may take significant time and
consume  significant  quantities  of  memory.  To  free up memory from
structures  in  a  compilable  grammar object that aren't needed after
compilation use GRAMMAR.CleanUp().

Once  you  have  debugged the syntax and semantics of your grammar you
may  store syntactic information for the grammar using the Reconstruct
method already mentioned. The declarations created by Reconstruct only
defines  the  syntax  for  the  grammar. The semantics must be rebound
separately. But it is much better to use UnMarshalGram as shown below,
which stores the grammar in a binary format.

For  example, line 13 of `building a simple grammar`_ creates a file
testlisp_mar.py  containing  a  function GRAMMAR() which will reconstruct
the syntax for the LispG grammar::

    # from DLispShort.py
    def unMarshalLispG():
      import kjParser
      LispG = kjParser.UnMarshalGram('testlisp_mar')
      DeclareTerminals(LispG)
      BindRules(LispG)
      return LispG

This function can then be used in another file, provided
GrammarBuild() given in `building a simple grammar`_ has been executed at
some point in the past, thusly::

    import DLispShort
    LGrammar = DLispShort.unMarshalLispG()


Errors raised
~~~~~~~~~~~~~

You may see the following errors:

LexTokenError
  This usually means the lowest level of the parser ran into a string it
  couldn't recognize.

BadPunctError
  You tried to make a whitespace character a punctuation. This is not
  currently allowed.

EOFError, SyntaxError
  You tried to parse a string that is not valid for the grammar.

TokenError
  During parser generation you used a string in the rule definitions
  that wasn't previously registered as a terminal, nonterminal, or
  punctuation.

NotSLRError
  You attempted to build a grammar that is not "SLR" according to the
  definition of Aho and Ullman. Either the grammar is ambiguous, or it
  doesn't have a derivation for the root nonterminal, or it is too
  tricky for the generator.

Furthermore   NondetError,   ReductError,  FlowError,  ParseInitError,
UnkTermError or errors raised by other modules shouldn't happen. If an
error  that  shouldn't  happen happens there are two possibilities (1)
you  have  fiddled  with  the  code  or  data structures and you broke
something, or (2) there is a serious bug in the module.

Possible Gotchas
~~~~~~~~~~~~~~~~

This  package  has  a  number  of  known  deficiencies,  and there are
probably many that are yet to be discovered.

Syntax errors are not reported nicely. Sorry.

Currently,  there  is  no  way  to to resolve grammar ambiguities. For
example a C construct::

  if (x)
  if (y)
    x = 0;
  else
    y = 1;

could have the else associated with either the first or second if; the
grammar doesn't indicate which. This is normally resolved by informing
the parser generator to prefer one binding or the other. No method for
providing  a  preference  is implemented here, yet. Let me know if you
need such a method or if you have any suggestions.

Keywords  of the meta-grammar cannot name tokens of the object grammar
(see footnote above).

If  you  want  keywords  to be recognized without case sensitivity you
must declare G.SetCaseSensitivity(0) before any keyword declarations.

Name  and  regular  expression  collisions  are not always checked and
reported.  If  you  name  two rules the same, for example, you may get
undefined behavior.

The  lexical analysis implementation is not as fast as it could be (of
course).  It  also  sees  all  white space as a "single space" so, for
example,  if indentation is significant in your grammar (as in Python)
you'll need a different lexical analyzer. Also if x=+y means something
different  from  x  = + y (as it did in the original C, I believe) you
may  have  trouble. Happily the lexical component can be easily "plug
replaced" by another implementation if needed.

Also,  the  system  currently only handles SLR grammars (as defined by
Aho  and  Ullman), as mentioned above. If you get a NonSLRError during
grammar  compilation you need a better parser generator. I may provide
one, if I have motivation and time.

I  know of no outright bugs. Trust me, they're there. Please find them
for  me and tell me about them. I'm not a big expert on parsing so I'm
sure I've made some errors, particularly at the lexical level.

Further Reading
~~~~~~~~~~~~~~~

A  standard  reference  for  parsing  and  compiler,  interpreter, and
translator implementation is Principles of Compiler Design, by Aho and
Ullman (Addison Wesley).

