Wednesday, July 13, 2011

Generating syntax diagrams from Parcon parsers (and from regular expressions)

Parcon 0.1.23 is out, and it has what I think is the coolest feature of Parcon yet: syntax diagram generation. Parcon can now automatically generate syntax diagrams for any Parcon grammar, and from regular expressions as well.

Let's take a look at the expression evaluator example provided on Parcon's PyPI page:
from parcon import number, Forward, InfixExpr
import operator
expr = Forward()
term = number[float] | "(" + expr + ")"
term = InfixExpr(term, [("*", operator.mul), ("/", operator.truediv)])
term = InfixExpr(term, [("+", operator.add), ("-", operator.sub)])
expr << term(name="expr")

To generate a syntax diagram for this expression evaluator, all we have to do is this:
expr.draw_productions_to_png({}, "syntax-expr.png")

Which produces the following image: (click on the image to view a larger version)

At this point, you might be wondering how the syntax diagram generator is able to work out what names to give all of the productions in the resulting syntax diagram. It can't tell what Python variables hold each parser; Python doesn't provide any means for figuring that out. So how does it do it?

The answer is that little bit at the end of the last line of the expression evaluator: (name="expr"). parser(name="example") is short for Name("example", parser). Name is a parser that functions exactly as the parser it's constructed with, but it "tags", so to speak, the parser with a name that will be used when generating syntax diagrams. The number parser (which we imported from Parcon) is also named by Parcon itself, hence number and digit as productions.

So far, so good. But what if we wanted part of our grammar to be written as a regular expression? Regular expressions do have their advantages. Parcon supports this via the Regex parser, but you might think that the resulting syntax diagram would simply include a box containing the regular expression itself.

This, however, is not what happens. Parcon is smart enough that, for most regular expressions, it can actually parse through the regular expression and create a syntax diagram for it. For example:
r = Regex("hello (world|james|alex)\\.( How are you\\?)?")(name="example regex")
r.draw_productions_to_png({}, "test8.png")

The resulting image, believe it or not, is this:

Parcon can decompose most regular expressions in this way. There are a few that it can't; regular expressions that include lookahead are an example of this. For such regular expressions, they will be included simply as a box containing the entire regex.

Now, of course, comes the "gotcha" in syntax diagram generation: not all Parcon parsers can be converted to syntax diagrams. Some, like Except, don't really have any sensible way in which they could be drawn as a syntax diagram. Some, like Bind, don't have enough information to actually generate a syntax diagram. Only parsers that subclass parcon.railroad.Railroadable (or one of its subclasses, such as parcon._GRParser) can be converted to syntax diagrams.

In fact, if you try to create a syntax diagram from a Parcon grammar that uses any of these parsers, you'll get an exception. But don't panic and start thinking that you'll have to avoid these parsers altogether; this is where Description comes in.

Description is similar to Name, both in use and in functionality; parser(description="example") and parser(desc="example") are both shortcuts for Description("example", parser). However, there's one major difference between Name and Description: the contents of Name instances are themselves converted to syntax diagrams and included in the resulting .png file, whereas the contents of Description instances are not. Thus:
person = Literal("world")(name="person")
greeting = (Literal("hello") + person)(name="greeting")
greeting.draw_productions_to_png({}, "greeting-name.png")

results in:

person = Literal("world")(description="person")
greeting = (Literal("hello") + person)(name="greeting")
greeting.draw_productions_to_png({}, "greeting-description.png")

results in:

This can be used to allow parsers that can't be converted to syntax diagrams to take part in your grammar: as long as they're wrapped in a Description, you can use them just fine, since Parcon won't ever try to convert them into a syntax diagram.

That's it for today, folkes. The syntax diagram generator can actually be used by itself, independently from Parcon, to generate arbitrary syntax diagrams, but I'll save that for another blog post. I'll also save what that empty dict I keep passing to draw_productions_to_png is for another blog post.

Monday, July 4, 2011

Parcon website back up, and railroad diagrams

Parcon's website is back up. I apologize to everyone that was affected by the downtime. It turned out that the circuit breaker behind Parcon's website's server had tripped and, as a result, the server had gone down. Everything should be back to normal now.

In other news, I'm currently working on adding support to Parcon for generation of railroad diagrams. While graphs (see two blog posts ago) are geared more toward developers writing various Python things that use Parcon, railroad diagrams will be geared more toward those creating/writing data to be parsed by a specific set of Parcon parsers.

I'll write another post as soon as I've got railroad diagram generation working.

Sunday, July 3, 2011

Webserver dead

Sadly, It seems that Alex's webserver isn't currently running.
I've been told that he should be able to fix it tomorrow, so hopefully by then it will be up.

Alex will (or at least should) update this after it's back up.

Edit The webserver is back up.

Thursday, June 30, 2011

Visualizing parsers

Parcon 0.1.22 is out, with a really cool new feature: parser visualization. Parcon now includes the ability to generate graphs from parsers that describe what those parsers do.

There are still some kinks to be worked out, and I'll save a detailed blog post on Parcon graphing for when those are fixed, but the basic thing works. For example, the expression evaluator Parcon example (which is present on the PyPI page for Parcon), when converted to a graph via this:

results in this (click on the image to view a larger version):

In order for graphing to work, you have to have Graphviz installed, and specifically you need to have the dot program installed. You do not, however, need this installed to use Parcon, and in fact, you don't even need it to call graph(); you just need it when you actually call draw().

You'll probably notice some of the kinks. Two of the major ones are that InfixExpr doesn't display any representation of the functions it was passed, and the child parsers for Then and First are displayed out of order. Once I get those, and a few others, solved, I'll write a more detailed post on visualizing parsers with Parcon.

But you've got to admit, even as it is it's pretty cool.

Sunday, June 26, 2011

Parsing into a dict

If you've ever used Pyparsing before, you probably know that you can "tag" certain parts of a Pyparsing grammar, and the results will provide a dictionary mapping those tags to the values that their corresponding grammar components parsed to:
>>> from pyparsing import *
>>> parser = Word(alphas).setResultsName("first") + Word(alphas).setResultsName("second")
>>> results = parser.parseString("one two")
>>> results["first"]
>>> results["second"]

So, how does one go about doing this in Parcon? The answer is not as obvious as it is with Pyparsing, since Parcon parsers result in a single value, not a composite of tokens and a dictionary in the form of Pyparsing's ParseResults. This doesn't mean that tagging's impossible, however, or even difficult. It's actually relatively simple:
>>> from parcon import *
>>> parser = (alpha_word["first"] + alpha_word["second"])[dict]
>>> results = parser.parse_string("one two")
>>> results["first"]
>>> results["second"]
>>> results
{'second': 'two', 'first': 'one'}
>>> type(results)
<type 'dict'>

What's actually going on here?

The first thing that you need to know is that parser["tag"] is short for Tag("tag", parser). This only works when "tag" is a string (unicode strings also work); Tag must be used explicitly for other types of values.

What Tag does is wraps the result of whatever parse it's passed in a Pair, with the key set to "tag" (or whatever value was specified as the tag) and the value set to whatever the underlying parser resulted in.

Pair is a subclass of tuple (it's actually created by collections.namedtuple). Parcon's concatenation of tuples when using + does not apply to namedtuples or any value that simply subclasses tuple, so instances of Pair will be preserved even when using + to string things together.

From this, we can tell that the parser:
alpha_word["first"] + alpha_word["second"]

will result in (Pair('first', 'one'), Pair('second', 'two')) when handed "one two" as input. So how do we get this into a dictionary?

Simple. You'll notice that in the example near the top of this post, that word-parsing parser snippet was wrapped in parentheses and then transformed with [dict], which, as you probably know, is short for Translate. Here's the magical property: Pair, though it may be its own class, subclasses from tuple, which means that dict will recognize a tuple of tuples and convert it into a dictionary. Bingo. We simply pass the result through a Transform with dict as the transformation function, and our result becomes:
{'second': 'two', 'first': 'one'}

which is exactly what we want.

Of course, if you have a bunch of nested concatenations and list-producing parsers (such as ZeroOrMore), you'll probably want to change [dict] to [flatten][dict] to flatten them all out. flatten treats Pairs (and any other subclass of tuple) as individual objects and so does not flatten them out, so this works as expected.

Thursday, June 16, 2011

Pyparsing examples in Parcon

A lot of people at #python have encouraged me to port some of Pyparsing's examples to Parcon, so I've decided to write up a blog post with some of those examples. Some of them parse the same input but produce output that's slightly different from what the corresponding Pyparsing example produces; such differences are noted accordingly.

Let's start with Pyparsing's hello, world example first:
from parcon import alpha_word
greet = alpha_word + "," + alpha_word + "!"
hello = "Hello, World!"
print hello, "->", greet.parse_string(hello)

This prints:
Hello, World! -> ('Hello', 'World')

The main difference between the Pyparsing version and the Parcon version is that the result doesn't contain the "," and the "!". In my experience, the output of literals in the grammar is typically ignored, so Parcon by default discards literal values. If the value of a literal is important, SignificantLiteral("...") should be used instead; changing "," to SignificantLiteral(",") and "!" to SignificantLiteral("!") would have made the Parcon example follow Pyparsing's behavior.

Now let's try another one. This one is Chemical Formulas, the second example on this page. It doesn't include the atomic weight calculator at present, but I'll add this in at some point. Here's the parser:
from parcon import *
element = Word(lower_chars, init_chars=upper_chars)
integer = Word(digit_chars)[int]
element_ref = element + Optional(integer, 1)
chemical_formula = +element_ref
# Examples:
print chemical_formula.parse_string("H2O") # [('H', 2), ('O', 1)]
print chemical_formula.parse_string("H2SO4") # [('H', 2), ('S', 1), ('O', 4)]
print chemical_formula.parse_string("NaCl") # [('Na', 1), ('Cl', 1)]
print chemical_formula.parse_string("Au") # [('Au', 1)]

The last example I'm providing in this post is wordsToNum, the fourth example on the same page that chemicalFormula appeared on. It takes a human-readable number specification like "twelve thousand three hundred forty five" and parses it into the representative number (12345, in that case).
from operator import mul
from functools import partial
# This part is copied from the Pyparsing version of wordsToNum, with
# some newlines removed
unitDefinitions = [
("zero",       0), ("oh",         0), ("zip",        0), ("zilch",      0),
("nada",       0), ("bupkis",     0), ("one",        1), ("two",        2),
("three",      3), ("four",       4), ("five",       5), ("six",        6),
("seven",      7), ("eight",      8), ("nine",       9), ("ten",       10),
("eleven",    11), ("twelve",    12), ("thirteen",  13), ("fourteen",  14),
("fifteen",   15), ("sixteen",   16), ("seventeen", 17), ("eighteen",  18),
("nineteen",  19)]
tensDefinitions = [
("twenty",  20), ("thirty",  30), ("forty",   40),
("fourty",  40), # for the spelling-challenged...
("fifty",   50), ("sixty",   60), ("seventy", 70), ("eighty",  80),
("ninety",  90)]
majorDefinitions = [
("thousand",    int(1e3)), ("million",     int(1e6)), ("billion",     int(1e9)),
("trillion",    int(1e12)),("quadrillion", int(1e15)),("quintillion", int(1e18))]
# Now we get into the Parcon-specific code.
def make(text, number):
return AnyCase(text)[lambda x: number]

unit = Longest(*[make(t, n) for t, n in unitDefinitions])
ten = Longest(*[make(t, n) for t, n in tensDefinitions])
mag = Longest(*[make(t, n) for t, n in majorDefinitions])
product = partial(reduce, mul)
section = (Optional(unit[lambda t: t*100] + "hundred") + -ten + -unit)[flatten][sum]
number = ((section + mag)[product][...] + Optional(section, 0))[flatten][sum]
number = Exact(number, Whitespace() | "-" | "and")
# Some examples (all of which return the corresponding int value):
print number.parse_string("zero")
print number.parse_string("one")
print number.parse_string("five")
print number.parse_string("ten")
print number.parse_string("seventeen")
print number.parse_string("twenty")
print number.parse_string("twenty one")
print number.parse_string("fifty five")
print number.parse_string("one hundred")
print number.parse_string("one hundred three")
print number.parse_string("two hundred ten")
print number.parse_string("six hundred forty two")
print number.parse_string("eight hundred fifty")
print number.parse_string("one thousand")
print number.parse_string("one thousand one")
print number.parse_string("one thousand five")
print number.parse_string("one thousand thirty")
print number.parse_string("one thousand forty two")
print number.parse_string("one thousand one hundred")
print number.parse_string("one thousand one hundred fifty nine")
print number.parse_string("five thousand one hundred fifty nine")
print number.parse_string("twenty thousand one hundred fifty nine")
print number.parse_string("forty one thousand one hundred fifty nine")
print number.parse_string("two hundred forty one thousand one hundred fifty nine")
print number.parse_string("one million")
print number.parse_string("one million two hundred forty one thousand one hundred fifty nine")

That's it for this post!

Parcon: a new parser combinator library

Parcon is a Python parser combinator library I'm working on. I've released it on PyPI here.

(I've also released Pargen, a formatter combinator library, as a submodule of Parcon, but I'll write a separate blog post on Pargen later.)

I wrote Parcon to improve on some things that I think Pyparsing does wrong. One of those things is Pyparsing's lack of, in my opinion, useful error messages. For example, let's consider a grammar that parses an open parenthesis, any number of "a" or "b", and a close parenthesis. This looks like this in Pyparsing:
expr = "(" + ZeroOrMore(Literal("a") | "b") + ")"

Simple enough. If you call expr.parseString("(abbab)"), it returns just fine. If, however, you call expr.parseString("(a"), you get an exception with a message something like this:

Expected ")" (at char 2), (line:1, col:3)

This message omits information: "a", "b", or ")" would all be valid characters here, but only ")" is shown. The corresponding Parcon grammar:
expr = "(" + ZeroOrMore(SignificantLiteral("a") | SignificantLiteral("b")) + ")"

provides a more informative error message when expr.parseString("(a") is called:

At position 2: expected one of "a", "b", ")"

This includes all possible options, not just the last one.

This shortcoming of Pyparsing becomes more obvious when parsing grammars consisting of a number of alternatives, each of which start with a particular string. Pyparsing will only provide the last such expected string, while Parcon will provide all of them.

Four other shortcomings in Pyparsing that Parcon improves on:

  • In Pyparsing, parsers are mutable: parse actions can be added to them and so on. This makes it hard to reuse parsers reliably: a parse action might be added to a parser by one piece of code with others not realizing it. Pyparsing provides a copy function to get around this, but this requires using copy on any parser that might possibly be reused, which is especially tedious in libraries consisting simply of sets of predefined parsers.

    Parcon obviates this by making parsers immutable, with the sole exception of Forward. Parse actions, in particular, are created using the Transform parser, which is constructed as Transform(parser, function); it passes the result of the specified parser through the specified function, returning the result of that function. parser[function] is shorthand for this, so parser[function] is the rough equivalent of pyparsing_parser.addParseAction(function), except that the original parser isn't modified by this in any way.
  • Pyparsing's Literal, by default, does not suppress itself. From my experience writing parsers, suppressed literals are quite a bit more common than significant literals. Parcon's Literal is suppressed by default; SignificantLiteral is Parcon's non-suppressed alternative.
  • Pyparsing can automatically parse out whitespace from within a grammar. This, however, doesn't account for when comments and such need to be automatically removed. Parcon allows a whitespace parser to be specified when calling parseString; this parser will be applied between every other parser in the grammar, and its results will be discarded. (This parser defaults to Whitespace(), a Parcon parser that parses carriage returns, newlines, spaces, and tabs, if it isn't specified.)

    Of course, this could have the result of removing, for example, spaces in string literals being parsed by a Parcon grammar. Parcon provides a parser called Exact to prevent this: Exact(parser) is a parser that acts exactly like the parser it's created with, except that it sets the whitespace parser to Invalid() (a parser that never matches anything) while parsing the parser it was constructed with.
  • Pyparsing does not provide any sort of monadic Bind parser, which would be needed to parse, for example, a binary protocol packet consisting of a certain number of bytes representing the length of the packet, followed by that many bytes consisting of the packet's data. (Yes, Parcon can parse binary data just as well as it can parse textual data.) Parcon provides both Bind and Return parsers, which, together, make Parcon a monadic parser combinator library. This opens up numerous possibilities for grammars that can be written using Parcon.
If these features sound cool to you, open a terminal, type pip install parcon, and give it a whirl! Documentation and examples are provided here. Enjoy!