Sunday, June 26, 2011

Parsing into a dict

If you've ever used Pyparsing before, you probably know that you can "tag" certain parts of a Pyparsing grammar, and the results will provide a dictionary mapping those tags to the values that their corresponding grammar components parsed to:
>>> from pyparsing import *
>>> parser = Word(alphas).setResultsName("first") + Word(alphas).setResultsName("second")
>>> results = parser.parseString("one two")
>>> results["first"]
>>> results["second"]

So, how does one go about doing this in Parcon? The answer is not as obvious as it is with Pyparsing, since Parcon parsers result in a single value, not a composite of tokens and a dictionary in the form of Pyparsing's ParseResults. This doesn't mean that tagging's impossible, however, or even difficult. It's actually relatively simple:
>>> from parcon import *
>>> parser = (alpha_word["first"] + alpha_word["second"])[dict]
>>> results = parser.parse_string("one two")
>>> results["first"]
>>> results["second"]
>>> results
{'second': 'two', 'first': 'one'}
>>> type(results)
<type 'dict'>

What's actually going on here?

The first thing that you need to know is that parser["tag"] is short for Tag("tag", parser). This only works when "tag" is a string (unicode strings also work); Tag must be used explicitly for other types of values.

What Tag does is wraps the result of whatever parse it's passed in a Pair, with the key set to "tag" (or whatever value was specified as the tag) and the value set to whatever the underlying parser resulted in.

Pair is a subclass of tuple (it's actually created by collections.namedtuple). Parcon's concatenation of tuples when using + does not apply to namedtuples or any value that simply subclasses tuple, so instances of Pair will be preserved even when using + to string things together.

From this, we can tell that the parser:
alpha_word["first"] + alpha_word["second"]

will result in (Pair('first', 'one'), Pair('second', 'two')) when handed "one two" as input. So how do we get this into a dictionary?

Simple. You'll notice that in the example near the top of this post, that word-parsing parser snippet was wrapped in parentheses and then transformed with [dict], which, as you probably know, is short for Translate. Here's the magical property: Pair, though it may be its own class, subclasses from tuple, which means that dict will recognize a tuple of tuples and convert it into a dictionary. Bingo. We simply pass the result through a Transform with dict as the transformation function, and our result becomes:
{'second': 'two', 'first': 'one'}

which is exactly what we want.

Of course, if you have a bunch of nested concatenations and list-producing parsers (such as ZeroOrMore), you'll probably want to change [dict] to [flatten][dict] to flatten them all out. flatten treats Pairs (and any other subclass of tuple) as individual objects and so does not flatten them out, so this works as expected.

