I have a file, called genes.txt, which I'd like to become a data.frame. It's got a lot of lines, each line has three, tab delimited fields:
mike$ wc -l genes.txt
42476 genes.txt
I'd like to read this file into a data.frame in R. I use the command read.table, like this:
genes = read.table(
genes_file,
sep="\t",
na.strings="-",
fill=TRUE,
col.names=c("GeneSymbol","synonyms","description")
)
Which seems to work fine, where genes_file points at genes.txt. However, the number of lines in my data.frame is significantly less than the number of lines in my text file:
> nrow(genes)
[1] 27896
and things I can find in the text file:
mike$ grep "SELL" genes.txt
SELL CD62L|LAM1|LECAM1|LEU8|LNHR|LSEL|LYAM1|PLNHR|TQ1 selectin L
don't seem to be in the data.frame
> grep("SELL",genes$GeneSymbol)
integer(0)
it turns out that
genes = read.delim(
genes_file,
header=FALSE,
na.strings="-",
fill=TRUE,
col.names=c("GeneSymbol","synonyms","description"),
)
works just fine. Why does read.delim work when read.table not?
If it's of use, you can recreate genes.txt using the following commands which you should run from a command line
curl -O ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
gzip -cd gene_info.gz | awk -Ft '$1==9606{print $3 "\t" $5 "\t" $9}' > genes.txt
be warned, though, that gene_info.gz is 101MBish.