Regex expression is too greedy
- by alastairs
I'm writing a regular expression to match data from the IMDb soundtracks data file. My regexes are mostly working, although they are in places slurping too much text into my named groups. Take the following regex for example:
"^ Performed by '?(?<performer>.*)('? \(qv\))?$"
The performer group includes the string ' (qv) as well as the performer's name. Unfortunately, because the records are not consistently formatted, some performers' names are surrounded by single quotation marks whilst others are not. This means they are optional as far as the regex is concerned.
I've tried marking the last group as a greedy group using the ?> group specifier, but this appeared to have no effect on the results.
I can improve the results by changing the performer group to match a small range of characters, but this reduces my chances of parsing the name out correctly. Furthermore, if I were to just exclude the apostrophe character, I would then be unable to parse, e.g., band names containing apostrophes, such as Elia's Lonely Friends Band who performed Run For Your Life featured in Resident Evil: Apocalypse.