An Introduction To Scala Parser Combinators - Part 1: Parser Basics

Thursday, 13.01.2011, 20:30

An introduction into how Scala Parser Combinators work under the hood and how to write your own Parsers & Combinators. Takes a low level approach to present the inner workings of Parser Combinators.

What Is A Parser (In A Functional Sense)?

Let’s approach this bit by bit. We are thinking in FP terms, so a Parser is obviously a function that turns something into something (it parses the input and produces an output). So in sluggish Pseudo-Scala, it is something like this:

def parse(input:Input):Output = ...

However, parsing may fail or succeed, so we need some feedback on that. We want to keep it purely functional, so we are not using exception handling. What we will do is introduce an abstract class that helps us tell a successful parsing from an unsuccessful one:

sealed abstract class Result
case class Success(result:Output) extends Result
case class Failure(message:String) extends Result

Btw.: We have just defined an abstract class and two child classes with working equals, toString, hashCode and getters in three lines. Put that in Java’s pipe and smoke it! But I digress. Now we just refine our parse method a bit:

def parse(input:Input):Result = ...

So now we have defined what a Parser is and have worked out a way to communicate results. Still, that does not seem that useful yet, does it? We still have to do all the work in our one parse method, we need no fancy-schmanzy FP for that, you can do that in Fortran with a big array and a bunch of GOTOs.

Well, we have only covered the “Parser” in “Parser Combinators”. The overall idea is to combine a lot of little, simple Parsers into a powerful complex one. So our Parser probably only parses a small part of the input and leaves something behind for another Parser to continue. That something is again of the same type as the original input and needs to be attached to the result, so we modify our Result class a bit:

sealed abstract class Result
case class Success(result:Output, next:Input) extends Result
case class Failure(message:String, next:Input) extends Result

So when our Parser successfully parses a bit of input, it will return the resulting output and the input that is left after it has consumed everything it accepted. If parsing fails, it returns an error message and the input it was originally given, so another Parser can try and continue where it gave up.

So What Are The “Combinators” In “Parser Combinators”?

When we say that something is done with “Parser Combinators” it is actually done with “Parsers combined with Parser Combinators”. That however sounds like it is straight out of the Department Of Redundancy Department, so we stick with the former. We have just established that “Parsers” are functions that parse input into output together with some error handling and tracking what the current input is.

A Combinator’s job is to combine simple Parsers and turn them into a new, more complex Parser. As we program in a functional style, the result is again a function:

def combine(parserA:(Input => Result), 
            parserB:(Input => Result)):(Input=>Result) = ...

That’s it. That would be the signature of a Combinator combining two Parsers into a new one. Now we have already covered more or less the basics of Parser Combinators. Of course we have ommited the implementations of our functions and still it does not look all that impressive, but all we need to do now is to fill in some details we have skipped over so far and then we can start hacking away at our first Parsers and Combinators.

How It Is Really Done In The Scala API

We slightly simplified our view of Parser Combinators when compared to the real implementation in Scala, so let’s fill in the missing details and see how the Parsers and Combinators are defined in scala.util.parsing.combinator.Parsers


A short intermission for Java people:
It is at this point important to remember how Scala realizes “Functions as first class members”, i.e. how Scala “tricks” the VM to pass functions around as values. If you pass around a function, what you actually pass around is an object that implements an interface with one method: apply (well, actually more than one but this is not important right now). The exact signature of the apply method depends on the signature of the function the object represents, so imagine the following function:

def foo(i:Int):String = "The number is " + i

If you pass it around, you actually pass around an instance that is more or less equivalent to this Java class:

//it extends "Function1" as it is a function with one parameter
//the first type parameter is the argument type, the second the return type
//(Scala has a special type, "Unit" for functions/methods that do not return a value)
public class foo extends Function1<String, Int> {
    public String apply(int i) {
        return "The number is " + i
    }
}

What is more: as we actually use classes to represent our functions, we can do everything we can do with other classes as well. Specifically, we can inherit and add members. Even state. That however can turn nasty, so I do not condone it.

To the Scala Buffs: please forgive me if I have oversimplyfied this process, but in essence that should be correct and this understanding helped me a lot in getting a grasp around Scala in general and Parser Combinators in particular.


In Scala, a Parser is a function with the following signature:

def apply(i:Input):ParseResult[T]

We will come later to how Input is defined, let’s take a closer look at ParseResult[T] first. The type parameter T is the type we want as a result of the parsing process. Obviously, we want to define what we want as a result ourselves, so it needs to be a generic type. Typically when working with Parser Combinators you define a Domain Model that represents your possible result types and build instances during parsing.

ParseResult is slightly more complex than our own Result but not too different (here is a shortened version of the definition):

sealed abstract class ParseResult[+T] { 
    def get: T
    val next: Input
}
case class Success[+T](result: T, override val next: Input) extends ParseResult[T] {
    def get: T = result
}
sealed abstract class NoSuccess(val msg: String, override val next: Input) extends ParseResult[Nothing] { 
    def get: Nothing = error("No result when parsing failed")
}
case class Failure(override val msg: String, override val next: Input) extends NoSuccess(msg, next)
case class Error(override val msg: String, override val next: Input) extends NoSuccess(msg, next)

There are some more goodies attached to these classes in the real implementation, but we will skip them for now. I recommend studying the sources in scala/util/parsing/combinator/Parsers.scala. For now it suffices to note the differences to our Result:

  • The “no success” case is further split up in Failure and Error where Error designates a fatal result, i.e. something so bad we should not be able to recover. Failure is just a “normal” error where the Parser could not parse the input.
  • The get method can be called on any instance to retrieve the parsing result, on NoSuccess instances it will cause an Exception. The get method is used to peel the actual result out of the wrapper. If you deal with the ParseResult instances yourself, it is important to use pattern matching so you only try and get results from the right ones. But we will see how to work with the results later.

The definition of the Parser type consists more or less of this:

abstract class Parser[+T] extends (Input => ParseResult[T]) {
    def apply(in: Input): ParseResult[T] 
        
    //many more interesting functions follow here...
}

Here we see a peculiarity of Scala: A Parser is a function as well as a class. Because it is a class, it is possible to add a lot of other functions to it. Those functions are our Combinators. Each Parser we use or build ourselves is a subclass of this abstract Parser defintion. This is why we can call all the predefined combinator functions on it, many of which have the signature we used as an exemplified combinator above. Of course not every combinator must be a member, if you want to write your own combinators you do not have to make them a member of Parser by using your own subclass, implicit conversion or whatever.

So now we are missing only the Input. To understand how Input works we need to have a look at how Parser Combinators are organized in the Scala API. For now it suffices to look at scala/util/parsing/combinator/Parsers.scala where all necessary types and methods are defined. It basically defines one big trait called Parsers. Within Parsers you find all the types we have discussed so far. Most importantly we find two type definitions:

type Elem
type Input = Reader[Elem]

We see that the definition of Elem is an abstract type. Why this is so and exactly why they do not make Parsers a generic trait like Parsers[Elem] would be too much too discuss here. If you are interested, you should read this article about cows eating fish. What is important for us now is how we have to use the Parsers trait. It basically defines a little Microcosmos in which all the Parser Combinator stuff lives. If you want to write your own Parser (don’t worry, we will get to that shortly), you define your own Parsers trait like this:

trait MyParsers extends Parsers {
    type Elem = MyElem
        
    //add your magic here
}

A very common case would of course be Strings as input, in which case a single Elem would be a char:

trait MyStringParsers extends Parsers {
    type Elem = Char
}

An example of a predifined Parsers microcosmos for Strings is the RegexParsers trait in the scala.util.parsing.combinator package. (If you are parsing Strings, most of the time you will want to extend that trait and not Parsers) To use your own Parser trait you just mix it into the class you want to do the parsing with.

For now we will not delve further into the workings of the Reader[Elem]. We will start with parsing Strings in the next section for which Scala provides the necessary String Readers. We will look at how Readers work bit by bit as we need it.

Writing A Simple Parser By Hand

So, finally, we will start actually building a Parser and simple Combinators. In most Parser Combinator tutorials I have read, people simply use some of the predefined Parsers and Combinators to solve a relatively complex parsing problem in a couple of lines. This is of course what the whole Parser Combinator API in Scala is for, and most of the time you will only need to use the predifined ones, as they really cover 95% of all use cases. You will almost always just combine existing Parsers, not write them from scratch. However I only really understood how Parser Combinators work when I had studied the source and had written some I was missing myself, so let’s build some simple Parsers and Combinators ourselves, from the ground up.

We will be solving a really simple parsing task: accept any input that consists of the chars 'a', 'b', '0' and '1'. As a result we simply want a list of the matched chars. So we start off with our skeleton Parser trait:

trait Ab01Parsers extends Parsers {
    type Elem = Char
}

Now we add a simple Parser to the trait that parses the char 'a':

//though it is a function, we actually declare it like an 
//object, as we need to inherit all the other goodies
//from the base Parser. We return a char wrapped in a 
//ParseResult if successful, so the type is Parser[Char]
val aParser = new Parser[Char] {
    def apply(in:Input):ParseResult[Char] =
        //"first" returns the first (duh!) element the reader
        //provides for us. If there are no more, the result depends on the
        //reader. Preferably some special token indicating the end of input
        //is returned. CharSequenceReaders return \u001a             
        val c = in.first
        //Hooray, we parsed a char. Now we need to provide input
        //for all following Parsers. "rest" does that for us. It 
        //returns everything *but* first
        if (c == 'a') Success(c, in.rest)
        //                            ↑
        //Oh Noez, parsing failed. We provide a useful message and 
        //the input the next Parser should continue with: the same
        //as we got, i.e. the next Parser will just try and pick off
        //where we failed.
        else          Failure("Expected 'a' got '" + c + "'", in)
        //                                                     ↑
}

Phew, a lot of work for a single char. But remember, this was just to understand how it works under the hood. When you use Parser Combinators “properly” you can use all the goodies already provided by the Scala API! (Obviously there is already a Parser to parse a single char...)

Now we could provide the same Parser for 'b', '0' and '1', but that would of course be pretty daft. Instead, we extend our Parser to parse any char we want:

//note that this is now not a single object/function
//but a function that returns a Parser based on the given 
//argument (a function returning a function)
def charParser(expected:Char) = new Parser[Char] {
    def apply(in:Input):ParseResult[Char] = {
        val c = in.first
        if (c == expected) Success(c, in.rest)
        else               Failure("Expected '" + expected + "' got '" + c + "'", in)
    }
}

Now we have a Parser that can parse any letter we want, but how do we build one that parses any one of the four chars we allow? We could of course write a new Parser that does that, but let’s try and solve that by combining four basic Parsers, one for each char. So we need some Parser that first tries one Parser, if that fails, the other. So let’s write a function that turns two of our char Parsers into one new Parser:

def or(p1:Parser[Char], p2:Parser[Char]):Parser[Char] = Parser { in =>
    //try the first Parser...
    p1(in) match {
        //if it succeeded, we are done!
        case s:Success => s
        //if not, we use the result of the second Parser
        case _         => p2(in)
    }
}

Now we can build a Parser that parses one of our desired chars like this:

val ab01NotNice = or(charParser('a'), or(charParser('b'), or(charParser('0'), charParser('1'))))

Well, this works, but that does not look particularly nice, does it? We would want a nice infix notation here. So our or method needs to be a member of Parser. In order to do that, we have to override the Parser class of the parent Parsers trait we inherit from so we can add the methods we want. If we do that, we get the following:

abstract class Parser[T] extends super.Parser[T] {
    def or(right:Parser[T]):Parser[T] = {
        val left = this
        new Parser[T] {
            def apply(in:Input) =
                left(in) match {
                    case s:Success[T] => s
                    case _            => right(in)
            }
        }
    }
}

Now, by redeclaring Parser in our trait, all the previously defined Parsers are instances of our Parser class and therefore have the or method. So now we can write the or method like this:

def ab01 = charParser('a') or charParser ('b') or charParser('0') or charParser ('1')

Now that is nice and readable. We can now parse one of our chars, but we wanted to parse a whole string of them! So we need a way to say “apply this Parser again and again as often as possible”. So we need another Combinator that turns a Parser into our repeated Parser:

def repeat[T](p:Parser[T]) = new Parser[List[T]] {
    def apply(in:Input):Success[List[T]] = {
        //apply the given Parser
        p (in) match {
            //if that succeeded, recurse and prepend the result
            case Success(t, next) => val s = apply(next)
                                     Success(t::s.get, s.next)
            //if it failed, end recursion and return the empty list
            case _                => Success(Nil, in)
        }
    }
}

Finally, we can build the Parser that parses any string consisting of 'a', 'b', '0' and '1':

val myParser = repeat(ab01)

To run the Parser, we can do the following:

def run(s:String):Option[List[Char]] = {
    //wrap our input into a reader
    val input = new CharSequenceReader(s)
    //run the Parser on the input
    myParser(input) match {
        //if all input has been successfully consumed, return result
        //this checks if all input is gone   ↓
        case Success(list, next) if (next.atEnd) => Some(list)
        //either an error or there is still input left
        case _                                   => None
    }
}

This will return the resulting list only if all the input was consumed. Of course this will swallow all error messages we got, so this is only proof-of-concept. You can easily modify this to return what you like, e.g. Either[String, List[Char]] for either an error message or the result.

Doing It Properly Using The Already Prepared Parsers

The previous example was of course overly complicated, but it helps understand how Parser Combinators actually work under the hood. Most tutorials try to impress people too much at first with a lot of high-level fancy parsing that is very slick, but does not really teach how it actually works. Also I found out that being able to write your own Parsers manually is really, really useful sometimes, but more of that in another article. It is also important to know what actually goes on under the hood when your Parser runs, as debugging then becomes a lot easier.

But, without further ado, here the “proper” way to implement the above Parser in Scala:

trait Ab01ProperParsers extends Parsers {
    type Elem = Char
    //"elem" parses exactly one element of the defined "Elem" type.
    //In our case this is a Char, just like our manual Parser.
    //The "|" Combinator is the same as our "or" Combinator
    //so this reads "element 'a' or element 'b' or ..."
    val ab01:Parser[Char] = elem('a') | elem('b') | elem('0') | elem('1')
    //The postfix "*" Combinator repeats the Parser it as applied to as often as possible.
    //The asterisk was chosen as this works like the Kleene Star
    val myParser:Parser[List[Char]] = ab01*
}

(Of course you could just use RegexParsers, but for the sake of this example we pretend we do not know about the Regular Expression Parsers)

So far what we have achieved to parse has not been too impressive, we could have done this with some loops or a RegExp. In the next article we will tackle something a bit more mature, a complete Parser for Java expressions made up of primitive standard Parsers. This will (hopefully) show you how useful end easy to use Parser combinators are!