F# How to tokenise user input: separating numbers, units, words?

Posted by David White on Stack Overflow See other posts from Stack Overflow or by David White
Published on 2011-01-11T03:08:23Z Indexed on 2011/01/11 3:54 UTC
Read the original article Hit count: 201

Filed under:
|

I am fairly new to F#, but have spent the last few weeks reading reference materials. I wish to process a user-supplied input string, identifying and separating the constituent elements. For example, for this input:

XYZ Hotel: 6 nights at 220EUR / night plus 17.5% tax

the output should resemble something like a list of tuples:

[ ("XYZ", Word); ("Hotel:", Word);
("6", Number); ("nights", Word);
("at", Operator); ("220", Number);
("EUR", CurrencyCode); ("/", Operator); ("night", Word);
("plus", Operator); ("17.5", Number); ("%", PerCent); ("tax", Word) ]

Since I'm dealing with user input, it could be anything. Thus, expecting users to comply with a grammar is out of the question. I want to identify the numbers (could be integers, floats, negative...), the units of measure (optional, but could include SI or Imperial physical units, currency codes, counts such as "night/s" in my example), mathematical operators (as math symbols or as words including "at" "per", "of", "discount", etc), and all other words.

I have the impression that I should use active pattern matching -- is that correct? -- but I'm not exactly sure how to start. Any pointers to appropriate reference material or similar examples would be great.

© Stack Overflow or respective owner

Related posts about F#

Related posts about pattern-matching