Poor performance / speed of regex with lookahead

Posted by Hugo Zaragoza on Stack Overflow See other posts from Stack Overflow or by Hugo Zaragoza
Published on 2012-03-22T21:45:13Z Indexed on 2012/03/24 5:29 UTC
Read the original article Hit count: 223

Filed under:
|
|
|

I have been observing extremely slow execution times with expressions with several lookaheads.

I suppose that this is due to underlying data structures, but it seems pretty extreme and I wonder if I do something wrong or if there are known work-arounds.

The problem is determining if a set of words are present in a string, in any order. For example we want to find out if two terms "term1" AND "term2" are somewhere in a string. I do this with the expresion:

(?=.*\bterm1\b)(?=.*\bterm2\b)

But what I observe is that this is an order of magnitude slower than checking first just

\bterm1\b

and just then

\bterm2\b

This seems to indicate that I should use an array of patterns instead of a single pattern with lookaheads... is this right? it seems wrong...

Here is an example test code and resulting times:

public static void speedLookAhead() {
    Matcher m, m1, m2;
    boolean find;
    int its = 1000000;

    // create long non-matching string
    char[] str = new char[2000];
    for (int i = 0; i < str.length; i++) {
      str[i] = 'x';
    }
    String test = str.toString();

    // First method: use one expression with lookaheads
    m = Pattern.compile("(?=.*\\bterm1\\b)(?=.*\\bterm2\\b)").matcher(test);
    long time = System.currentTimeMillis();
    ;
    for (int i = 0; i < its; i++) {
      m.reset(test);
      find = m.find();
    }
    time = System.currentTimeMillis() - time;
    System.out.println(time);

    // Second method: use two expressions and AND the results
    m1 = Pattern.compile("\\bterm1\\b").matcher(test);
    m2 = Pattern.compile("\\bterm2\\b").matcher(test);
    time = System.currentTimeMillis();
    ;
    for (int i = 0; i < its; i++) {
      m1.reset(test);
      m2.reset(test);
      find = m1.find() && m2.find();
    }
    time = System.currentTimeMillis() - time;
    System.out.println(time);
  } 

This outputs in my computer:

1754
150

© Stack Overflow or respective owner

Related posts about java

Related posts about regex