Poor performance / speed of regex with lookahead
- by Hugo Zaragoza
I have been observing extremely slow execution times with expressions with several lookaheads.
I suppose that this is due to underlying data structures, but it seems pretty extreme and I wonder if I do something wrong or if there are known work-arounds.
The problem is determining if a set of words are present in a string, in any order. For example we want to find out if two terms "term1" AND "term2" are somewhere in a string. I do this with the expresion:
(?=.*\bterm1\b)(?=.*\bterm2\b)
But what I observe is that this is an order of magnitude slower than checking first just
\bterm1\b
and just then
\bterm2\b
This seems to indicate that I should use an array of patterns instead of a single pattern with lookaheads... is this right? it seems wrong...
Here is an example test code and resulting times:
public static void speedLookAhead() {
Matcher m, m1, m2;
boolean find;
int its = 1000000;
// create long non-matching string
char[] str = new char[2000];
for (int i = 0; i < str.length; i++) {
str[i] = 'x';
}
String test = str.toString();
// First method: use one expression with lookaheads
m = Pattern.compile("(?=.*\\bterm1\\b)(?=.*\\bterm2\\b)").matcher(test);
long time = System.currentTimeMillis();
;
for (int i = 0; i < its; i++) {
m.reset(test);
find = m.find();
}
time = System.currentTimeMillis() - time;
System.out.println(time);
// Second method: use two expressions and AND the results
m1 = Pattern.compile("\\bterm1\\b").matcher(test);
m2 = Pattern.compile("\\bterm2\\b").matcher(test);
time = System.currentTimeMillis();
;
for (int i = 0; i < its; i++) {
m1.reset(test);
m2.reset(test);
find = m1.find() && m2.find();
}
time = System.currentTimeMillis() - time;
System.out.println(time);
}
This outputs in my computer:
1754
150