Imputing missing data in aligned sequences
- by Kwame Oduro
I want a simple perl script that can help me impute missing nucleotides in aligned sequences: As an example, my old_file contains the following aligned sequences:
seq1
ATGTC
seq2
ATGTC
seq3
ATNNC
seq4
NNGTN
seq5
CTCTN
So I now want to infer all Ns in the file and get a new file with all the Ns inferred based on the majority nucleotide at a particular position. My new_file should look like this:
seq1
ATGTC
seq2
ATGTC
seq3
ATGTC
seq4
ATGTC
seq5
CTCTC
A script with usage: "impute_missing_data.pl old_file new_file" or any other approach will be helpful to me.
Thank you.