Python: How best to parse a simple grammar?

Posted by Rosarch on Stack Overflow See other posts from Stack Overflow or by Rosarch
Published on 2010-05-31T18:36:29Z Indexed on 2010/05/31 18:43 UTC
Read the original article Hit count: 257

Filed under:

ply

Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale.

I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them.

From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.)

Some sample inputs and outputs:

"CS 2110" => ("CS", 2110) # 0

"CS 2110 and INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
"CS 2110, INFO 3300" => [("CS", 2110), ("INFO", 3300)] # 1
"CS 2110, 3300, 3140" => [("CS", 2110), ("CS", 3300), ("CS", 3140)] # 1

"CS 2110 or INFO 3300" => [[("CS", 2110)], [("INFO", 3300)]] # 2

"MATH 2210, 2230, 2310, or 2940" => [[("MATH", 2210), ("MATH", 2230), ("MATH", 2310)], [("MATH", 2940)]] # 3

If the entire description is just a course, it is output directly.
If the courses are conjoined ("and"), they are all output in the same list
If the course are disjoined ("or"), they are in separate lists
Here, we have both "and" and "or".

One caveat that makes it easier: it appears that the nesting of "and"/"or" phrases is never greater than as shown in example 3.

What is the best way to do this? I started with PLY, but I couldn't figure out how to resolve the reduce/reduce conflicts. The advantage of PLY is that it's easy to manipulate what each parse rule generates:

def p_course(p):
 'course : DEPT_CODE COURSE_NUMBER'
 p[0] = (p[1], int(p[2]))

With PyParse, it's less clear how to modify the output of parseString(). I was considering building upon @Alex Martelli's idea of keeping state in an object and building up the output from that, but I'm not sure exactly how that is best done.

def addCourse(self, str, location, tokens): self.result.append((tokens[0][0], tokens[0][1]))

def makeCourseList(self, str, location, tokens):

dept = tokens[0][0] new_tokens = [(dept, tokens[0][1])] new_tokens.extend((dept, tok) for tok in tokens[1:])

self.result.append(new_tokens)

For instance, to handle "or" cases:

    def __init__(self):
            self.result = []
            # ...

self.statement = (course_data + Optional(OR_CONJ + course_data)).setParseAction(self.disjunctionCourses)

def disjunctionCourses(self, str, location, tokens): if len(tokens) == 1: return tokens

print "disjunction tokens: %s" % tokens

How does disjunctionCourses() know which smaller phrases to disjoin? All it gets is tokens, but what's been parsed so far is stored in result, so how can the function tell which data in result corresponds to which elements of token? I guess I could search through the tokens, then find an element of result with the same data, but that feel convoluted...

What's a better way to approach this problem?

Developer IT

Python: How best to parse a simple grammar? - Developer IT

Python: How best to parse a simple grammar?

python

parsing

nlp

pyparsing

ply

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about parsing

Hot to fix nautilus desktop on linux mint

Is parsing JSON faster than parsing XML

Looking for a tutorial on Recursive Descent Parsing.

Parsing XML with Hpricot, a Gem of a Ruby Gem

Parsing scripts that use curly braces

Categories cloud