Generate regular expression to match strings from the list A, but not from list B

Posted by Vlad on Stack Overflow See other posts from Stack Overflow or by Vlad
Published on 2012-10-14T22:13:37Z Indexed on 2012/10/15 3:38 UTC
Read the original article Hit count: 168

Filed under:
|
|
|
|

I have two lists of strings ListA and ListB. I need to generate a regular expression that will match all strings in ListA and will not match any string in ListB.

  • The strings could contain any combination of characters, numbers and punctuation.
  • If a string appears on ListA it is guaranteed that it will not be in the ListB.
  • If a string is not in either of these two lists I don't care what the result of the matching should be.

The lists typically contain thousands of strings, and strings are fairly similar to each other.

I know the trivial answer to this question, which is just generate a regular expression of the form (Str1)|(Str2)|(Str3) where StrN is the string from ListA. But I am looking for a more efficient way to do this.

Ideal solution would be some sort of tool that will take two lists and generate a Java regular expression for this.

Update 1: By "efficient", I mean to generate expression that is shorter than trivial solution. The ideal algorithm would generate the shorted possible expression. Here are some examples.

ListA = { C10 , C15, C195 }
ListB = { Bob, Billy }

The ideal expression would be

/^C1.+$/

Another example, note the third element of ListB

ListA = { C10 , C15, C195 }
ListB = { Bob, Billy, C25 }

The ideal expression is

/^C[^2]{1}.+$/

The last example

ListA = { A , D ,E , F , H } ListB = { B , C , G , I }

The ideal expression is the same as trivial solution which is

/^(A|D|E|F|H)$/

Also, I am not looking for the ideal solution, anything better than trivial would help. I was thinking along the lines of generating the list of trivial solutions, and then try to merge the common substrings while watching that we don't wander into ListB territory.

*Update 2: I am not particularly worried about the time it takes to generate the RegEx, anything under 10 minutes on the modern machine is acceptable

© Stack Overflow or respective owner

Related posts about java

Related posts about regex