How can I detect common substrings in a list of strings
        Posted  
        
            by 
                danio
            
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by danio
        
        
        
        Published on 2009-09-11T13:19:35Z
        Indexed on 
            2013/11/12
            3:55 UTC
        
        
        Read the original article
        Hit count: 194
        
algorithm
|pattern-matching
Given a set of strings, for example:
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
I want to be able to detect that these are three sets of files:
- EntireS[1,2]
- J27[Red,Green]P[1,2]
- JournalP[1,2][Red,Green,Blue]
Are there any known ways of approaching this problem - any published papers I can read on this?
The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.
Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.
[Edited 15/09/09 to add more sample strings]
© Stack Overflow or respective owner