Python: Trouble with YACC

Posted by Rosarch on Stack Overflow See other posts from Stack Overflow or by Rosarch
Published on 2010-05-29T18:22:59Z Indexed on 2010/05/29 18:32 UTC
Read the original article Hit count: 140

Filed under:
|
|

I'm parsing sentences like:

"CS 2310 or equivalent experience"

The desired output:

[[("CS", 2310)], ["equivalent experience"]]

YACC tokenizer symbols:

tokens = [
    'DEPT_CODE',
    'COURSE_NUMBER',
    'OR_CONJ',
    'MISC_TEXT',
]

t_DEPT_CODE = r'[A-Z]{2,}'
t_COURSE_NUMBER  = r'[0-9]{4}'

t_OR_CONJ = r'or'

t_ignore = ' \t'

terms = {'DEPT_CODE': t_DEPT_CODE,
         'COURSE_NUMBER': t_COURSE_NUMBER,
         'OR_CONJ': t_OR_CONJ}

for name, regex in terms.items():
    terms[name] = "^%s$" % regex

def t_MISC_TEXT(t):
    r'\S+'
    for name, regex in terms.items():
        # print "trying to match %s with regex %s" % (t.value, regex)
        if re.match(regex, t.value):
            t.type = name
            return t

    return t

(MISC_TEXT is meant to match anything not caught by the other terms.)

Some relevant rules from the parser:

precedence = (
    ('left', 'MISC_TEXT'),
)


def p_statement_course_data(p):
    'statement : course_data'
    p[0] = p[1]

def p_course_data(p):
    'course_data : course'
    p[0] = p[1]


def p_course(p):
    'course : DEPT_CODE COURSE_NUMBER'
    p[0] = make_course(p[1], int(p[2]))


def p_or_phrase(p):
    'or_phrase : statement OR_CONJ statement'
    p[0] = [[p[1]], [p[3]]] 


def p_misc_text(p):
    '''text_aggregate : MISC_TEXT MISC_TEXT
                      | MISC_TEXT text_aggregate
                      | text_aggregate MISC_TEXT '''
    p[0] = "%s %s" % (p[0], [1])

def p_text_aggregate_statement(p):
    'statement : text_aggregate'
    p[0] = p[1]

Unfortunately, this fails:

# works as it should
>>> token_list("CS 2110 or equivalent experience")
[LexToken(DEPT_CODE,'CS',1,0), LexToken(COURSE_NUMBER,'2110',1,3), LexToken(OR_CONJ,'or',1,8), LexToken(MISC_TEXT,'equivalent',1,11), LexToken(MISC_TEXT,'experience',1,22)]

# fails. bummer.
>>> parser.parse("CS 2110 or equivalent experience")
Syntax error in input: LexToken(MISC_TEXT,'equivalent',1,11)

What am I doing wrong? I don't fully understand how to set precedence rules.

Also, this is my error function:

def p_error(p):
    print "Syntax error in input: %s" % p

Is there a way to see which rule the parser was trying when it failed? Or some other way to make the parser print which rules its trying?

© Stack Overflow or respective owner

Related posts about python

Related posts about nlp