Python Regular Expressions: Capture lookahead value (capturing text without consuming it)

Posted by Lattyware on Stack Overflow See other posts from Stack Overflow or by Lattyware
Published on 2012-04-09T23:16:19Z Indexed on 2012/04/09 23:28 UTC
Read the original article Hit count: 319

Filed under:
|
|
|

I wish to use regular expressions to split words into groups of (vowels, not_vowels, more_vowels), using a marker to ensure every word begins and ends with a vowel.

import re

MARKER = "~"
VOWELS = {"a", "e", "i", "o", "u", MARKER}

word = "dog"

if word[0] not in VOWELS:
    word = MARKER+word

if word[-1] not in VOWELS:
    word += MARKER

re.findall("([%]+)([^%]+)([%]+)".replace("%", "".join(VOWELS)), word)

In this example we get:

[('~', 'd', 'o')]

The issue is that I wish the matches to overlap - the last set of vowels should become the first set of the next match. This appears possible with lookaheads, if we replace the regex as follows:

re.findall("([%]+)([^%]+)(?=[%]+)".replace("%", "".join(VOWELS)), word)

We get:

[('~', 'd'), ('o', 'g')]

Which means we are matching what I want. However, it now doesn't return the last set of vowels. The output I want is:

[('~', 'd', 'o'), ('o', 'g', '~')]

I feel this should be possible (if the regex can check for the second set of vowels, I see no reason it can't return them), but I can't find any way of doing it beyond the brute force method, looping through the results after I have them and appending the first character of the next match to the last match, and the last character of the string to the last match. Is there a better way in which I can do this?

The two things that would work would be capturing the lookahead value, or not consuming the text on a match, while capturing the value - I can't find any way of doing either.

© Stack Overflow or respective owner

Related posts about python

Related posts about regex