Python re module becomes 20 times slower when called on greater than 101 different regex

Posted by Wiil on Stack Overflow See other posts from Stack Overflow or by Wiil
Published on 2013-06-26T16:12:56Z Indexed on 2013/06/26 16:21 UTC
Read the original article Hit count: 220

Filed under:
|
|

My problem is about parsing log files and removing variable parts on each lines to be able to group them. For instance:

s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)

I have about 120+ matching rules like those above.

I have found no performances issues while searching successively on 100 different regex. But a huge slow down comes when applying 101 regex.

Exact same behavior happens when replacing my rules set by

for a in range(100):
    s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)

Got 20 times slower when putting range(101) instead.

# range(100)
% ./dashlog.py file.bz2
== Took  2.1 seconds.  ==

# range(101)
% ./dashlog.py file.bz2
== Took  47.6 seconds.  ==

Why such thing is happening ? And is there any known workaround ?

(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)

© Stack Overflow or respective owner

Related posts about python

Related posts about regex