Building the MiniLatex the Parser

MiniLatex Demo App

In this article we discuss the overall design of the MiniLatex parser. Its function is to translate source text into an abstract syntax tree (AST). This is a recursive, tree-like structure that “understands” the grammar of LaTeX.  Recursiveness is what makes it possible to have environments nested with environments which may themselves have macros, etc., etc.  Once an AST has been generated, it can be rendered into HTML in a relatively straightforward way and so displayed on a web page. We will discuss how this is done in a future article.

The AST is defined by an Elm union type which captures the various kinds Latex source text — plain text, comments, macros, inline or display math text, environments, etc. We describe this structure in section 1. The parser is built out of a small set of primitive parsers using parser combinators.  We describe how this is done in section 2.  In section 3, we briefly discuss the MiniLatex grammar, noting that is not context-free. (We will discuss it at length in future article).  In section 4, we discuss the parser for environments, which lie in the non-context-free part of the grammar.  It turns out that Elm’s andThen combinator handles this case with aplomb.  Parenthetically, the type signature of andThen is, up to a permutation of arguments, the same as that of the bind operator for monads.

I would like to acknowledge the generous help of the members of the Elm Slack group throughout the work on the MiniLatex parser, with special thanks to Ilias van Peer.

The code snippets below are written in Elm, as is the parser-renderer itself.  You can try out MiniLatex using MiniLatex Demo App  — there is an option for viewing the AST produced by parsing the text.

1. AST

Let’s begin with the AST, which is an Elm union type, defined below as a LatexExpression. Such an expression is a recursive structure consisting of a number of possible elements — a list of LatexExpressions,  an InlineMath element, an Environment, etc.  In many cases, an element can itself contain LatexExpressions.  This is the case, for example, of for Environment whose body is a LatexExpression.

type LatexExpression = 
     LXString String
   | Comment String
   | InlineMath String
   | DisplayMath String 
   | Macro String (List LatexExpression) (List LatexExpression)
   | Environment String (List LatexExpression) LatexExpression
   | SMacro String (List LatexExpression) (List LatexExpression) LatexExpression 
   | Item Int LatexExpression
   | LatexList (List LatexExpression)
   | LXError Error

One of the design goals is to keep the type definition for the AST small. Although the definition has grown slightly since the project began, so far, so good.

Examples

The role of the parser is to convert source text to an AST.  For example:

parse “$a^2 = 1$”       =>   InlineMath “a^2 = 1”

parse “$$a^2 = 1$$”   =>  DisplayMath “a^2 = 1”

parse “\emph{foo}”   =>  Macro “emph” [ ] ([LatexList ([LXString “foo”])])

 In the last example, the empty list [ ] is a placeholder for optional macro arguments, of which there are none.  For a final example, consider the source text

\begin{theorem}
There are are infinitely many primes.
\end{theorem}

When parsed, it yields the LatexExpression

Environment "theorem" [ ] 
  (LatexList 
    (
      [LXString "There are are infinitely many primes."]
    )
   )

Again, the empty list [ ] is a slot for optional arguments.

 

2. Parser Combinators

The parser is constructed using the elm-tools/parser a parser combinator library written by Evan Czaplicki.  From the standpoint of general taxonomy, the MiniLatex parser is a recursive descent parser of type LL(*) where the asterisk refers to potentially unbounded lookahead.  More about that in the comments of section 5.

Let’s look at the parser, drilling down from top to bottom.  The top-level parsing function, listed below, first gobbles whitespace using the ws parser, then parses as many LatexExpressions as it can, finally mapping them to a LatexList.

latexList : Parser LatexExpression
  latexList =
    inContext "latexList" <|
      (succeed identity
        |. ws
        |= repeat oneOrMore latexExpression
        |>map LatexList
      )
The inContext "latexList" phrase plays a role in producing good error messages.  See Evan Czaplicki’s discussion of that subject.  Here the symbols |. and |= are parser combinators which, roughly speaking, mean “ignore the parsed result” and “keep the parsed result,” respectively.  These combinators allow one to write parsers as part of “parser pipeline,” as illustrated in the introduction to elm-tools/parser.  Their type signatures are
  (|.) : Parser keep -> Parser ignore -> Parser keep
and
  (|=) : Parser (a -> b) -> Parser a -> Parser b
 while for map we have
  map: (a -> b) -> Parser a -> Parser b

What about the latexExpression parser in the above definition?  Listed below, it uses the oneOf combinator to try the various alternatives open to it until one succeeds or they all fail.  Compare the alternatives below to the definition of the AST type.

latexExpression : Parser LatexExpression
  latexExpression =
    oneOf [ texComment
           , lazy (\_ -> environment)
           , displayMathDollar
           , displayMathBrackets
           , inlineMath ws
           , macro ws
           , smacro
           , words
          ]

By way of elaboration, macro ws is the macro parser called with argument ws. An alternative is macro spaces, where spaces gobbles spaces but not newlines.  Different whitespace options are needed in different contexts.

 

We are getting closer to the bottom of the hierarchy.  Let’s look now at the macro parser.  It is built to return a value of type

Macro String (List LatexExpression) (List LatexExpression)
The expression  Macro is a  type constructor.  That is, given three arguments of the correct type, it returns a value of that type.  Look at the definition below for the macro parser.  It has a pipeline in which the three parsers macroName, repeat zeroOrMore optionalArg, and repeat zeroOrMore arg are arranged in succession.  They feed their results to the Macro type constructor.  The last parser in the pipeline gobbles white space so that the parser will not “stall.”  The end result is a parser which recognizes a Macro and returns the corresponding AST for it.
macro : Parser () -> Parser LatexExpression
  macro wsParser =
    inContext "macro" <|
      (succeed Macro
        |= macroName
        |= repeat zeroOrMore optionalArg
        |= repeat zeroOrMore arg
        |. wsParser
      )

3. A Grammar

As is well known, neither TeX nor LaTeX have a published grammar.  Indeed, TeX itself is Turing complete, so that anything as tame as a context-free grammar is out of the question.  But perhaps the “outer” part of LaTeX, the part loosely defined as that outside expressions of the form $ … $ and $$ … $$ has a context-free grammar?  This too, is not possible because of the unbounded number of constructs of the form

   \begin{x} ... \end{x}

where x \in \text{Environments}.  If the set of environments were finite, one could rewrite the grammar to be context-free at the expense of creating something cluttered and unpleasant, with one production for every element in the set of environments.  This solution works only if the set of environments is fixed in advance, a limitation that turns out to be undesirable.  Conclusion: we must embrace context-sensitivity in the grammar.

In fact, the situation is not so bad.  There is an almost one-to-one correspondence between productions in the grammar and parsing functions, and most of the productions are of the form P -> b, where P is a nonterminal and b is a sequence of terminals and non-terminals.  These are productions of the context-free type.  There are also a small number of productions of the form Pa -> b where P and b are as before and a is a terminal — a symbol furnishing context.

We will discuss the grammar in a future article.

4. The environment parser

Parser combinators are well-suited to handling context sensitivity.  Let’s see how to use them to parse environments.  The idea is to split the work between a parser envName that recognizes the beginning of an environment and a second parser environmentOfType that recognizing the rest. The first parser is of type Parser.String. If it succeeds on text “\begin{theorem}”, it returns the string “theorem” . The second is a function of type String -> Parser.LatexExpression.  Then environmentOfType "theorem" parses text like “2 + 2 = 4\end{theorem}”, returning

  LatexList ([LXString "2 + 2 = 4"]))

The two parsers can be put together like this

  andThen environmentOfType envName

The result is a parser of type Parser.LatexExpression which succeeds on the text

\begin{theorem}
2 + 2 = 4
\end{theorem}

returning

  Environment "theorem" [ ] (LatexList ([LXString "2 + 2 = 4"]))

To see why this works, consider the type signature

  andThen : (a -> Parser b) -> Parser a -> Parser b

If one sets a = String and b = LatexExpression, one sees that

  andThen environmentOfType envName

is a correctly typed expression.  The actual code used is slightly more complicated (see Note below), but this is the general idea. The main point is that andThen gives a way of running parsers in sequence and using the output of the first parser as an input to the second.  Note that up to permutation of arguments, andThen has the same type signature as does bind, the defining operator of monads.

 Note

Here is the actual code for the environment parser.
  environment : Parser LatexExpression
    environment =
      inContext "environment" <|
        lazy (\_ -> andThen environmentOfType envName)
The use of lazy is needed in the current version of Elm in order for recursion to work properly.  Its signature is
  lazy: (() -> Parser a) -> Parser a

5. On optimizations

MiniLatex breaks the source text up into logical paragraphs, then parses these paragraphs.  A logical paragraph is either an ordinary paragraph or an outer begin-end block. An advantage of this strategy is that the extent of errors is limited, as is the lookahead of the parser.  Thus it is effectively of type LL(N), where N is the maximum number of characters in a paragraph.  Another advantage is that one can implement an editing strategy in which only changed paragraphs are re-parsed and re-rendered.  This leads to a fast-render option that is far, far faster than a full render.  A disadvantage is that the span of grammatical constructions is limited to logical paragraphs.  For this reason a logical paragraph is either a normal paragraph or an outer begin-end block.  We note that parsing is expensive but rendering is quite cheap.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s