Python re module becomes 20 times slower when called on greater than 101 different regex
- by Wiil
My problem is about parsing log files and removing variable parts on each lines to be able to group them. For instance:
s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)
I have about 120+ matching rules like those above.
I have found no performances issues while searching successively on 100 different regex. But a huge slow down comes when applying 101 regex.
Exact same behavior happens when replacing my rules set by
for a in range(100):
s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)
Got 20 times slower when putting range(101) instead.
# range(100)
% ./dashlog.py file.bz2
== Took 2.1 seconds. ==
# range(101)
% ./dashlog.py file.bz2
== Took 47.6 seconds. ==
Why such thing is happening ?
And is there any known workaround ?
(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)