Algorithm to match list of regular expressions

Posted by DSII on Stack Overflow See other posts from Stack Overflow or by DSII
Published on 2010-06-05T19:26:50Z Indexed on 2010/06/05 19:32 UTC
Read the original article Hit count: 455

Filed under:
|
|
|

I have two algorithmic questions for a project I am working on. I have thought about these, and have some suspicions, but I would love to hear the community's input as well.

  1. Suppose I have a string, and a list of N regular expressions (actually they are wildcard patterns representing a subset of full regex functionality). I want to know whether the string matches at least one of the regular expressions in the list. Is there a data structure that can allow me to match the string against the list of regular expressions in sublinear (presumably logarithmic) time?

  2. This is an extension of the previous problem. Suppose I have the same situation: a string and a list of N regular expressions, only now each of the regular expressions is paired with an offset within the string at which the match must begin (or, if you prefer, each of the regular expressions must match a substring of the given string beginning at the given offset).

To give an example, suppose I had the string:

This is a test string

and the regex patterns and offsets:

(a) his.* at offset 0 (b) his.* at offset 1

The algorithm should return true. Although regex (a) does not match the string beginning at offset 0, regex (b) does match the substring beginning at offset 1 ("his is a test string").

Is there a data structure that can allow me to solve this problem in sublinear time?

One possibly useful piece of information is that often, many of the offsets in the list of regular expressions are the same (i.e. often we are matching the substring at offset X many times). This may be useful to leverage the solution to problem #1 above.

Thank you very much in advance for any suggestions you may have!

© Stack Overflow or respective owner

Related posts about c++

Related posts about regex