Python re module becomes 20 times slower when called on greater than 101 different regex
        Posted  
        
            by 
                Wiil
            
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Wiil
        
        
        
        Published on 2013-06-26T16:12:56Z
        Indexed on 
            2013/06/26
            16:21 UTC
        
        
        Read the original article
        Hit count: 292
        
My problem is about parsing log files and removing variable parts on each lines to be able to group them. For instance:
s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)
I have about 120+ matching rules like those above.
I have found no performances issues while searching successively on 100 different regex. But a huge slow down comes when applying 101 regex.
Exact same behavior happens when replacing my rules set by
for a in range(100):
    s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)
Got 20 times slower when putting range(101) instead.
# range(100)
% ./dashlog.py file.bz2
== Took  2.1 seconds.  ==
# range(101)
% ./dashlog.py file.bz2
== Took  47.6 seconds.  ==
Why such thing is happening ? And is there any known workaround ?
(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)
© Stack Overflow or respective owner