Generate regular expression to match strings from the list A, but not from list B
- by Vlad
I have two lists of strings ListA and ListB. I need to generate a regular expression that will match all strings in ListA and will not match any string in ListB.
The strings could contain any combination of characters, numbers and punctuation.
If a string appears on ListA it is guaranteed that it will not be in the ListB.
If a string is not in either of these two lists I don't care what the result of the matching should be.
The lists typically contain thousands of strings, and strings are fairly similar to each other.
I know the trivial answer to this question, which is just generate a regular expression of the form (Str1)|(Str2)|(Str3) where StrN is the string from ListA. But I am looking for a more efficient way to do this.
Ideal solution would be some sort of tool that will take two lists and generate a Java regular expression for this.
Update 1: By "efficient", I mean to generate expression that is shorter than trivial solution. The ideal algorithm would generate the shorted possible expression. Here are some examples.
ListA = { C10 , C15, C195 }
ListB = { Bob, Billy }
The ideal expression would be
/^C1.+$/
Another example, note the third element of ListB
ListA = { C10 , C15, C195 }
ListB = { Bob, Billy, C25 }
The ideal expression is
/^C[^2]{1}.+$/
The last example
ListA = { A , D ,E , F , H }
ListB = { B , C , G , I }
The ideal expression is the same as trivial solution which is
/^(A|D|E|F|H)$/
Also, I am not looking for the ideal solution, anything better than trivial would help. I was thinking along the lines of generating the list of trivial solutions, and then try to merge the common substrings while watching that we don't wander into ListB territory.
*Update 2: I am not particularly worried about the time it takes to generate the RegEx, anything under 10 minutes on the modern machine is acceptable