Python: (sampling with replacement): efficient algorithm to extract the set of DISSIMILAR N-tuples from a set

Posted by Homunculus Reticulli on Stack Overflow See other posts from Stack Overflow or by Homunculus Reticulli
Published on 2012-04-06T10:41:17Z Indexed on 2012/04/06 17:29 UTC
Read the original article Hit count: 211

Filed under:
|
|

I have a set of items, from which I want to select DISSIMILAR tuples (more on the definition of dissimilar touples later). The set could contain potentially several thousand items, although typically, it would contain only a few hundreds.

I am trying to write a generic algorithm that will allow me to select N items to form an N-tuple, from the original set. The new set of selected N-tuples should be DISSIMILAR.

A N-tuple A is said to be DISSIMILAR to another N-tuple B if and only if:

  • Every pair (2-tuple) that occurs in A DOES NOT appear in B

Note: For this algorithm, A 2-tuple (pair) is considered SIMILAR/IDENTICAL if it contains the same elements, i.e. (x,y) is considered the same as (y,x).

This is a (possible variation on the) classic Urn Problem. A trivial (pseudocode) implementation of this algorithm would be something along the lines of

def fetch_unique_tuples(original_set, tuple_size):
    while True:
        # randomly select [tuple_size] items from the set to create first set
        # create a key or hash from the N elements and store in a set
        # store selected N-tuple in a container
        if end_condition_met:
            break

I don't think this is the most efficient way of doing this - and though I am no algorithm theorist, I suspect that the time for this algorithm to run is NOT O(n) - in fact, its probably more likely to be O(n!). I am wondering if there is a more efficient way of implementing such an algo, and preferably, reducing the time to O(n).

Actually, as Mark Byers pointed out there is a second variable m, which is the size of the number of elements being selected. This (i.e. m) will typically be between 2 and 5.

Regarding examples, here would be a typical (albeit shortened) example:

original_list = ['CAGG', 'CTTC', 'ACCT', 'TGCA', 'CCTG', 'CAAA', 'TGCC', 'ACTT', 'TAAT', 'CTTG', 'CGGC', 'GGCC', 'TCCT', 'ATCC', 'ACAG', 'TGAA', 'TTTG', 'ACAA', 'TGTC', 'TGGA', 'CTGC', 'GCTC', 'AGGA', 'TGCT', 'GCGC', 'GCGG', 'AAAG', 'GCTG', 'GCCG', 'ACCA', 'CTCC', 'CACG', 'CATA', 'GGGA', 'CGAG', 'CCCC', 'GGTG', 'AAGT', 'CCAC', 'AACA', 'AATA', 'CGAC', 'GGAA', 'TACC', 'AGTT', 'GTGG', 'CGCA', 'GGGG', 'GAGA', 'AGCC', 'ACCG', 'CCAT', 'AGAC', 'GGGT', 'CAGC', 'GATG', 'TTCG']


# Select 3-tuples from the original list should produce a list (or set) similar to:

    [('CAGG', 'CTTC', 'ACCT')
     ('CAGG', 'TGCA', 'CCTG')
     ('CAGG', 'CAAA', 'TGCC')
     ('CAGG', 'ACTT', 'ACCT')
     ('CAGG', 'CTTG', 'CGGC')
     ....
     ('CTTC', 'TGCA', 'CAAA')
    ]

[[Edit]]

Actually, in constructing the example output, I have realized that the earlier definition I gave for UNIQUENESS was incorrect. I have updated my definition and have introduced a new metric of DISSIMILARITY instead, as a result of this finding.

© Stack Overflow or respective owner

Related posts about python

Related posts about algorithm