An Introduction To Scala Parser Combinators - Part 2: Parsing Literal Expressions

Jan 28, 2011 17:21 · 2774 words · 14 minutes read Scala Parser Combinators

In the previous post we covered the low-level basics of parser combinators. This time we will look at a more mature example that will hopefully illustrate the benefits of parser combinators and will do something useful. It will be a more extensive example of an expression parser, the classical example for parsers combinators. Unlike most other examples however, we will not simply evaluate expressions, but turn them into a simple syntax tree that can be used for more than just evaluating the expressions. The resulting parser & syntax tree will be surprisingly small, but as we will cover the process in depth I decided to split up the description into a series of posts. There are some points that are quite easy once you have wrapped your head around them, but when I started working with parsers I was disappointed in the lack of depth many tutorials offered. So this may be a bit TL;DR, yet I hope that for those that are really trying to get started with combinator parsing this will provide enough examples to get started with really challenging parsing tasks.

The Task

The problem we want to solve is the parsing of Java-like expressions. We want to allow expressions with integers, floats, booleans, strings, method calls and unary, binary and ternary operators.

I used a parser just like this in my l-system evaluator, as l-system symbols may be parameterized with expressions, l-system rules may be evaluated based on a boolean expression which in turn may use the results of the evaluation of the expressions attached to the l-system symbols on the left side of the rule. All in all, there is a lot of evaluation and re-evaluation going on, so we must be able to re-evaluate the expressions that are the result of our parsing run. The simple pocket calculator behavior that just outputs a number used in most parser examples will not be enough. To give an example, we want to be able to parse an expression like the following:

!(x*0.005 + max(sin(y), 1) >= 2)

After parsing, we want a result with which we can evaluate the expression with any values for x and y.

Our primary specification of our parser will simply be the Java Language specification chapter on expressions which we more or less just copy into our parser. This will show one of the major advantages of parser combinators: Specifications in BNF can be almost copied verbatim into our code (with a little care taken in respect to infinite recursions).

This tutorial will be split up in multiple parts, where each part expands the parsing syntax a bit until we finally arrive at the full fledged expression parser. As a little teaser for reading on: The final parser will parse all Java expressions, including all operators like +,-,*,/,>,>=,<=,== etc. pp., will allow String, Int, Float, Boolean literals and will be only 24 (twenty-four) lines long! The overall syntax tree (i.e. the class hierarchy modelling the result and taking care of the evaluation) will be only about 80-100 lines. How cool is that?

We will however make some simplifications:

we will only support numbers, booleans and strings, no objects, data structures, arrays etc.
we will allow integer & floating point literals, but internally we will only use Double to represent numbers (like JavaScript)
there will be no type checking, each literal value will be able to be converted into any other value, specifically:
- numbers → strings: We simply use the string representation of the number
- numbers → booleans: if the number is 0, it is false, everything else is true
- booleans → strings: true → "true", false → "false"
- booleans → numbers: true → 1.0, false → 0.0
- strings → numbers/booleans: attempt to parse string as target type literal, if it fails, we bail out with a runtime exception.
there will be no way to define functions, only functions out of a predefined set may be called
there will be no function overloading, just one function per name

These limitations prevent this tutorial from blowing out of proportion, otherwise I would need to write a book “Writing a Java Compiler with Parser Combinators” :) Also, some type checking limitations stem from the use in an L-System Generator, where dynamic typing seemed more fitting.

The Domain Model (Part 1)

Now that we know more or less what the input will be like and what we want to do with the result, let’s build a domain model that we want to use for evaluation and as output for our parser. I prefer to do this first, as it makes developing the parser easier. If you do not have specified your result properly before starting to build the parser (we could, for example, just output strings in the first version), there is a lot of refactoring necessary when switching to the final output. Also, we want to use unit tests to make sure our parser does what we expect it to do, and changing the parser result later would mean fiddling around with all tests again, which wastes a lot of time.

We start with a small subset of our final domain model for the first parsing attempts. We will add to our domain model as the need arises to express more complex expressions.

So, everything we parse will be an expression. An expression should be able to be turned into a result once we know the values for all variables (in my L-Generator use case, the variable bindings change constantly, so I needed to re-evaluate expressions constantly). This is what I came up with:

sealed trait Expression {
    def eval(env:Map[String,Literal]): Literal
}

It is a completely abstract trait (for Java People: exactly the same as a Java interface). It is sealed which means that it may only be implemented by classes in the same source file as in which it is declared. As environment we just use a map which maps variable names to Literals. A literal can be asked for a value without any further recursion or evaluation of variable names, so we choose at as result for our evaluation and as the target for the variable mapping.

A short intermission for Java People:

As in Scala classes/interfaces/objects/traits/whatever tend to be much shorter as in Java, and as one usually splits one’s code up into more smaller, better reusable parts as in Java, the limitation that only one type is defined per source file was dropped in Scala. So it is like the old C-days, you can put as much into a source file as you want. This is both a blessing and a curse. If properly applied (as I hope it is in this example) it prevents unnecessary short source files with only one or two lines (typical size for many Scala classes). On the other hand one can produce a truly nasty mess that way, so please use this feature with care and consideration.

This different handling of source files gives however rise to a very useful modifier, sealed. Unlike the Java final, sealed allows a class/trait/object to be implemented/subclassed but only in the same file as it is declared itself. This is useful as it is clear in the Rest of the API what subtypes may occur. Together with Pattern Matching this is a powerful tool to prevent runtime errors, as the compiler checks for sealed classes in pattern matches if all alternatives have been considered.

sealed trait Marker
case class Foo(i:Int) extends Marker
case class Bar(s:String) extends Marker
case class Baz(d:Double) extends Marker

def test(m:Marker) = m match {
    case Foo(i) => ...
    case Bar(s) => ...
    //Oops, forgot to check for Baz, this will cause a compiler warning.
    //This of course only works because Marker is sealed so the compiler
    //knows there can be no subclass coming from some other code at runtime 
    //it currently (at compile time) does not know about.
}

Now we need our Literals. A Literal can be used everywhere an expression can be used, it can also be evaluated with a variable mapping (though of course it will ignore it). So our Literal type looks like this:

sealed trait Literal extends Expression {
    def eval(env:Map[String, Literal]) = this
    def doubleValue:Double
    def boolValue:Boolean
    def stringValue:String
}

“Evaluating” a literal just means returning itself, so we define it here for all literals directly in the trait. As we dynamically want to turn a number into a string, a boolean into a number, etc, we provide method signatures for all Literals for this. Now we just need to implement our three literal types:

case class NumberLiteral(literal:Double) extends Literal{
    def doubleValue = literal
    def boolValue   = literal != 0.0
    def stringValue = literal.toString
    override def toString  = literal.toString
}

case class BooleanLiteral(literal:Boolean) extends Literal{
    def doubleValue = if (literal) 1.0 else 0.0
    def boolValue   = literal
    def stringValue = literal.toString
    override def toString  = literal.toString
}

case class StringLiteral(s:String) extends Literal{
    val literal = s.substring(1,s.length-1)//TODO: apply missing escapes
    def doubleValue = literal.toDouble
    def boolValue   = if (literal.toLowerCase == "false") false else true
    def stringValue = literal
    override def toString  = s
}

The implementations are pretty straightforward, there are just some points that are worth discussing:

we override toString so we get a string representation that can be parsed again.
for strings we want the string literal as it is parsed from the source as constructor argument, i.e. we want the leading and trailing " preserved. We omit string escape rules for now, i.e. we do not properly “unescape” \\n, \, etc. (This is left as an excercise to the reader)

Variables are really easy to represent:

case class Variable(name:String) extends Expression {
    def eval(env:Map[String,Literal]) = env(name)
    override def toString = name
}

Evaluating a variable will simply mean a lookup in the environment. Note that we do not provide any proper error handling. We are focusing on parsing here (don’t worry, we’ll get there soon), so if a variable is used but is not present in the scope, evaluation will simply fail at runtime.

This will be it for now, today we will limit ourselves to parsing & evaluating literals and variables.

Finally: The first parsers

Having defined our desired result, we can now write a parser with surprising ease. We will base this parser not on the BaseParsers trait as in the previous post, but on the JavaTokenParsers trait which will allow us to take even more shortcuts and keep our parser nice & simple.

The JavaTokenParsers are in turn based on RegexParsers, so our input type is already chosen for us: we parse Char Sequences, i.e. Strings. Also, the RegexParsers ancestor performs a highly useful task for us: it skips whitespace for us. We will explain what this means more precisely when we build the first more complex parsers, to oversimplify: in languages like Java which do not really care about whitespace, we can write our parsers more or less like all whitespace was implicitly parsed & ignored between our tokens for us.

Parsing Literals & Variables

We will start off with little parsers for literals and then combine them bit by bit to parse complex expressions. So let’s warm up with the first simple parser: boolean literals!

def boolean:Parser[Expression] = ("true" | "false")  ^^ {s => new BooleanLiteral(s.toBoolean)}

Let’s translate this from left to right:

We define a method named boolean that returns a Parser that in produces Expression instances as parse result. The scala compiler would of course infer the result type for us here, but while I make liberal use of type inference in my other code, I can only strongly advise you to use explicit return type for parsers, as it is easy to accidentaly end up with the wrong result type when combining more complex parsers.
The parser is either parses the literal string "true" (strings are implicitly converted to parsers that parse that literal string) or (| means or) the parses the literal string "false".
If parsing was successful, the result (a string) is plugged into a function (^^ takes the results from the parser to its left and plugs it into the function to its right)
The function takes the string and turns it into a boolean literal. toBoolean is a method of scala.collection.immutable.StringOps, to which the passed string is implicitly converted. (This implicit conversion is present by default in all Scala source because it is defined in the scala.Predef object)

The hardest part to wrap your head around here is the ^^ operator. It takes a Parser[S], a method S => T and produces a Parser[T] from it. It is like plugging an adapter to the first parser. It does not change what is parsed, but what the result of the parsing process is. (It is, by the way, the binding operation of the parsing monad or the third part of the Kleisli triple known to Haskell people as >>= (necessary to know if you want to show off in front of other geeks))

So if we left out the ^^... part, we would have a Parser[String]:

def booleanString:Parser[String] = ("true" | "false")

The parser resulting from applying ^^ works like this:

when given input, it parses the input with the parser to its left.
if that parsing fails, it just returns the error produced by the left parser
if parsing succeeds, it takes the Success(x) produced by the left parser, takes out the x, and calls the method to its right with x as argument.
then it returns the result of the method wrapped in a new Success

There is also a useful alternative to ^^ which does not care about the result of the parsing and always returns the same result, called ^^^. We could for example write a parser that only parses true with it:

def booleanTrue:Parser[Expression] = "true" ^^^ {new BooleanLiteral(true)}

Note how we do not have a function but a value to the right of ^^^.

The parsers for the other literals are even simpler. As we inherit from JavaTokenParsers there are already parsers provided for us for the other literals. They in turn use regular expressions to match the input and return the part of the input they matched. If you want to use regular expressions for your own input, I recommend having a look at the source of JavaTokenParsers. It is extremely simple however, you just define a regular expression parser with a regular expression instance like you define a string literal parser with a string literal. In the scope of RegexParsers and its decendants, regular expressions are implicitly converted into parsers that accept anything matching the given regular expression. We will however use the short cut and use the parsers provided for us:

def string  :Parser[Expression] = super.stringLiteral ^^ {
    s => new StringLiteral(s)}
def double  :Parser[Expression] = (decimalNumber | floatingPointNumber) ^^ {
    s => new NumberLiteral(s.toDouble)}
def int     :Parser[Expression] = wholeNumber ^^ {
    s => new NumberLiteral(s.toInt)}

def literal :Parser[Expression] = boolean | string | double | int

def variable:Parser[Expression] = ident ^^ {
    s => new Variable(s)}

The ident parser parses any valid Java identifier and is a good example of a regex parser. It is in turn defined like this:

def ident: Parser[String] = """[a-zA-Z_]\\w*""".r

The .r method turns a string into a regular expression. This in turn is converted into a parser parsing strings and returning the matched string.

Another Intermission for Java People: Unlike Java, everything in Scala is an Object. Also the “primitives” like ints, doubles. To be consistent with the naming of the other objects, they use uppercase identifiers, so it is Int not int. Therefore, int, double etc. are not reserved words in Scala anymore, so we give the method that parses a Java int the name int. (Some earlier Scala Versions still allowed int instead of Int but gave a deprecated warning. Since Scala 2.8 or so this has been dropped and int can be used like any other identifier as a method name)

To get our first working parser to parse expressions consisting of a single literal or variable, we write the main parser that parses our total expression:

def expression:Parser[Expression] = literal | variable

What Next?

Now that we have our first little parsers, we of course want to make sure they are working as we expect it. In the next post, I will give some simple examples of how to write unit tests for parsers. Parser Development lends itself wonderfully to TDD. I omitted testing in this post however, as it is already quite long and I wanted to focus on parsing essentials first. After that we will of course move on and implement operators and method calls. Stay tuned!