Search for string allowing for one mismatches in any location of the string, Python
- by Vincent
I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite) I am not sure how large the genome is but much more that 230,000 sequences.
I need to look for each of my sequences of 25 characters example(AGCCTCCCATGATTGAACAGATCAT).
The genome is formatted as a continuous string ie (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT.........)
I don't care where or how many times it is found, just yes or no. This is simple I think, str.find(AGCCTCCCATGATTGAACAGATCAT)
But I also what to find a close match defined as wrong(mismatched) at any location but only 1 location and record the location in the sequnce. I am not sure how do do this. The only thing I can think of is using a wildcard and performing the search with a wildcard in each position. ie search 25 times.
For example
AGCCTCCCATGATTGAACAGATCAT
AGCCTCCCATGATAGAACAGATCAT
close match with a miss-match at position 13
Speed is not a big issue I am only doing it 3 times. i hope but it would be nice it was fast.
The are programs that do this find matches and partial matches but I am looking for a type of partial match that is not available with these applications.
Here is a similar post for pearl but they are only comparing sequnces not searching a continuous string
Related post