Thursday, June 16, 2011

Parcon: a new parser combinator library

Parcon is a Python parser combinator library I'm working on. I've released it on PyPI here.

(I've also released Pargen, a formatter combinator library, as a submodule of Parcon, but I'll write a separate blog post on Pargen later.)

I wrote Parcon to improve on some things that I think Pyparsing does wrong. One of those things is Pyparsing's lack of, in my opinion, useful error messages. For example, let's consider a grammar that parses an open parenthesis, any number of "a" or "b", and a close parenthesis. This looks like this in Pyparsing:
expr = "(" + ZeroOrMore(Literal("a") | "b") + ")"

Simple enough. If you call expr.parseString("(abbab)"), it returns just fine. If, however, you call expr.parseString("(a"), you get an exception with a message something like this:

Expected ")" (at char 2), (line:1, col:3)

This message omits information: "a", "b", or ")" would all be valid characters here, but only ")" is shown. The corresponding Parcon grammar:
expr = "(" + ZeroOrMore(SignificantLiteral("a") | SignificantLiteral("b")) + ")"

provides a more informative error message when expr.parseString("(a") is called:

At position 2: expected one of "a", "b", ")"

This includes all possible options, not just the last one.

This shortcoming of Pyparsing becomes more obvious when parsing grammars consisting of a number of alternatives, each of which start with a particular string. Pyparsing will only provide the last such expected string, while Parcon will provide all of them.

Four other shortcomings in Pyparsing that Parcon improves on:

  • In Pyparsing, parsers are mutable: parse actions can be added to them and so on. This makes it hard to reuse parsers reliably: a parse action might be added to a parser by one piece of code with others not realizing it. Pyparsing provides a copy function to get around this, but this requires using copy on any parser that might possibly be reused, which is especially tedious in libraries consisting simply of sets of predefined parsers.

    Parcon obviates this by making parsers immutable, with the sole exception of Forward. Parse actions, in particular, are created using the Transform parser, which is constructed as Transform(parser, function); it passes the result of the specified parser through the specified function, returning the result of that function. parser[function] is shorthand for this, so parser[function] is the rough equivalent of pyparsing_parser.addParseAction(function), except that the original parser isn't modified by this in any way.
  • Pyparsing's Literal, by default, does not suppress itself. From my experience writing parsers, suppressed literals are quite a bit more common than significant literals. Parcon's Literal is suppressed by default; SignificantLiteral is Parcon's non-suppressed alternative.
  • Pyparsing can automatically parse out whitespace from within a grammar. This, however, doesn't account for when comments and such need to be automatically removed. Parcon allows a whitespace parser to be specified when calling parseString; this parser will be applied between every other parser in the grammar, and its results will be discarded. (This parser defaults to Whitespace(), a Parcon parser that parses carriage returns, newlines, spaces, and tabs, if it isn't specified.)

    Of course, this could have the result of removing, for example, spaces in string literals being parsed by a Parcon grammar. Parcon provides a parser called Exact to prevent this: Exact(parser) is a parser that acts exactly like the parser it's created with, except that it sets the whitespace parser to Invalid() (a parser that never matches anything) while parsing the parser it was constructed with.
  • Pyparsing does not provide any sort of monadic Bind parser, which would be needed to parse, for example, a binary protocol packet consisting of a certain number of bytes representing the length of the packet, followed by that many bytes consisting of the packet's data. (Yes, Parcon can parse binary data just as well as it can parse textual data.) Parcon provides both Bind and Return parsers, which, together, make Parcon a monadic parser combinator library. This opens up numerous possibilities for grammars that can be written using Parcon.
If these features sound cool to you, open a terminal, type pip install parcon, and give it a whirl! Documentation and examples are provided here. Enjoy!

No comments:

Post a Comment