Python: (sampling with replacement): efficient algorithm to extract the set of DISSIMILAR N-tuples from a set
- by Homunculus Reticulli
I have a set of items, from which I want to select DISSIMILAR tuples (more on the definition of dissimilar touples later). The set could contain potentially several thousand items, although typically, it would contain only a few hundreds.
I am trying to write a generic algorithm that will allow me to select N items to form an N-tuple, from the original set. The new set of selected N-tuples should be DISSIMILAR.
A N-tuple A is said to be DISSIMILAR to another N-tuple B if and only if:
Every pair (2-tuple) that occurs in A DOES NOT appear in B
Note: For this algorithm, A 2-tuple (pair) is considered SIMILAR/IDENTICAL if it contains the same elements, i.e. (x,y) is considered the same as (y,x).
This is a (possible variation on the) classic Urn Problem. A trivial (pseudocode) implementation of this algorithm would be something along the lines of
def fetch_unique_tuples(original_set, tuple_size):
while True:
# randomly select [tuple_size] items from the set to create first set
# create a key or hash from the N elements and store in a set
# store selected N-tuple in a container
if end_condition_met:
break
I don't think this is the most efficient way of doing this - and though I am no algorithm theorist, I suspect that the time for this algorithm to run is NOT O(n) - in fact, its probably more likely to be O(n!). I am wondering if there is a more efficient way of implementing such an algo, and preferably, reducing the time to O(n).
Actually, as Mark Byers pointed out there is a second variable m, which is the size of the number of elements being selected. This (i.e. m) will typically be between 2 and 5.
Regarding examples, here would be a typical (albeit shortened) example:
original_list = ['CAGG', 'CTTC', 'ACCT', 'TGCA', 'CCTG', 'CAAA', 'TGCC', 'ACTT', 'TAAT', 'CTTG', 'CGGC', 'GGCC', 'TCCT', 'ATCC', 'ACAG', 'TGAA', 'TTTG', 'ACAA', 'TGTC', 'TGGA', 'CTGC', 'GCTC', 'AGGA', 'TGCT', 'GCGC', 'GCGG', 'AAAG', 'GCTG', 'GCCG', 'ACCA', 'CTCC', 'CACG', 'CATA', 'GGGA', 'CGAG', 'CCCC', 'GGTG', 'AAGT', 'CCAC', 'AACA', 'AATA', 'CGAC', 'GGAA', 'TACC', 'AGTT', 'GTGG', 'CGCA', 'GGGG', 'GAGA', 'AGCC', 'ACCG', 'CCAT', 'AGAC', 'GGGT', 'CAGC', 'GATG', 'TTCG']
# Select 3-tuples from the original list should produce a list (or set) similar to:
[('CAGG', 'CTTC', 'ACCT')
('CAGG', 'TGCA', 'CCTG')
('CAGG', 'CAAA', 'TGCC')
('CAGG', 'ACTT', 'ACCT')
('CAGG', 'CTTG', 'CGGC')
....
('CTTC', 'TGCA', 'CAAA')
]
[[Edit]]
Actually, in constructing the example output, I have realized that the earlier definition I gave for UNIQUENESS was incorrect. I have updated my definition and have introduced a new metric of DISSIMILARITY instead, as a result of this finding.