R: Why does read.table stop reading a file?

Posted by Mike Dewar on Stack Overflow See other posts from Stack Overflow or by Mike Dewar
Published on 2010-06-10T16:25:28Z Indexed on 2010/06/10 16:42 UTC
Read the original article Hit count: 329

Filed under:

I have a file, called genes.txt, which I'd like to become a data.frame. It's got a lot of lines, each line has three, tab delimited fields:

mike$ wc -l genes.txt
   42476 genes.txt

I'd like to read this file into a data.frame in R. I use the command read.table, like this:

genes = read.table(
    genes_file, 
    sep="\t", 
    na.strings="-", 
    fill=TRUE,
    col.names=c("GeneSymbol","synonyms","description")
)

Which seems to work fine, where genes_file points at genes.txt. However, the number of lines in my data.frame is significantly less than the number of lines in my text file:

> nrow(genes)
[1] 27896

and things I can find in the text file:

mike$ grep "SELL" genes.txt 
SELL    CD62L|LAM1|LECAM1|LEU8|LNHR|LSEL|LYAM1|PLNHR|TQ1    selectin L

don't seem to be in the data.frame

> grep("SELL",genes$GeneSymbol)
integer(0)

it turns out that

genes = read.delim(
    genes_file,
    header=FALSE,
    na.strings="-",
    fill=TRUE,
    col.names=c("GeneSymbol","synonyms","description"),
)

works just fine. Why does read.delim work when read.table not?

If it's of use, you can recreate genes.txt using the following commands which you should run from a command line

curl -O ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
gzip -cd gene_info.gz | awk -Ft '$1==9606{print $3 "\t" $5 "\t" $9}' > genes.txt

be warned, though, that gene_info.gz is 101MBish.

© Stack Overflow or respective owner

Related posts about r