Python Regular Expressions: Capture lookahead value (capturing text without consuming it)
- by Lattyware
I wish to use regular expressions to split words into groups of (vowels, not_vowels, more_vowels), using a marker to ensure every word begins and ends with a vowel.
import re
MARKER = "~"
VOWELS = {"a", "e", "i", "o", "u", MARKER}
word = "dog"
if word[0] not in VOWELS:
word = MARKER+word
if word[-1] not in VOWELS:
word += MARKER
re.findall("([%]+)([^%]+)([%]+)".replace("%", "".join(VOWELS)), word)
In this example we get:
[('~', 'd', 'o')]
The issue is that I wish the matches to overlap - the last set of vowels should become the first set of the next match. This appears possible with lookaheads, if we replace the regex as follows:
re.findall("([%]+)([^%]+)(?=[%]+)".replace("%", "".join(VOWELS)), word)
We get:
[('~', 'd'), ('o', 'g')]
Which means we are matching what I want. However, it now doesn't return the last set of vowels. The output I want is:
[('~', 'd', 'o'), ('o', 'g', '~')]
I feel this should be possible (if the regex can check for the second set of vowels, I see no reason it can't return them), but I can't find any way of doing it beyond the brute force method, looping through the results after I have them and appending the first character of the next match to the last match, and the last character of the string to the last match. Is there a better way in which I can do this?
The two things that would work would be capturing the lookahead value, or not consuming the text on a match, while capturing the value - I can't find any way of doing either.