JavaCC: How can one exclude a string from a token? (A.k.a. understanding token ambiguity.)
- by java.is.for.desktop
Hello, everyone!
I had already many problems with understanding, how ambiguous tokens can be handled elegantly (or somehow at all) in JavaCC. Let's take this example:
I want to parse XML processing instruction.
The format is: "<?" <target> <data> "?>": target is an XML name, data can be anything except ?>, because it's the closing tag.
So, lets define this in JavaCC:
(I use lexical states, in this case DEFAULT and PROC_INST)
TOKEN : <#NAME : (very-long-definition-from-xml-1.1-goes-here) >
TOKEN : <WSS : (" " | "\t")+ > // WSS = whitespaces
<DEFAULT> TOKEN : {<PI_START : "<?" > : PROC_INST}
<PROC_INST> TOKEN : {<PI_TARGET : <NAME> >}
<PROC_INST> TOKEN : {<PI_DATA : ~[] >} // accept everything
<PROC_INST> TOKEN : {<PI_END : "?>" > : DEFAULT}
Now the part which recognizes processing instructions:
void PROC_INSTR() : {} {
(
<PI_START>
(t=<PI_TARGET>){System.out.println("target: " + t.image);}
<WSS>
(t=<PI_DATA>){System.out.println("data: " + t.image);}
<PI_END>
) {}
}
Let's test it with <?mytarget here-goes-some-data?>:
The target is recognized: "target: mytarget".
But now I get my favorite JavaCC parsing error:
!! procinstparser.ParseException: Encountered "" at line 1, column 15.
!! Was expecting one of:
!!
Encountered nothing? Was expecting nothing? Or what? Thank you, JavaCC!
I know, that I could use the MORE keyword of JavaCC, but this would give me the whole processing instruction as one token, so I'd had to parse/tokenize it further by myself. Why should I do that? Am I writing a parser that does not parse?
The problem is (i guess): hence <PI_DATA> recognizes "everything", my definition is wrong. I should tell JavaCC to recognize "everything except ?>" as processing instruction data.
But how can it be done?
NOTE: I can only exclude single characters using ~["a"|"b"|"c"], I can't exclude strings such as ~["abc"] or ~["?>"]. Another great anti-feature of JavaCC.
Thank you.