lapply slower than for-loop when used for a BiomaRt query. Is that expected?

Posted by ptocquin on Stack Overflow See other posts from Stack Overflow or by ptocquin
Published on 2012-09-05T19:49:47Z Indexed on 2012/09/06 21:39 UTC
Read the original article Hit count: 205

Filed under:
|
|

I would like to query a database using BiomaRt package. I have loci and want to retrieve some related information, let say description.

I first try to use lapply but was surprise by the time needed for the task to be performed. I thus tried a more basic for-loop and get a faster result.

Is that expected or is something wrong with my code or with my understanding of apply ? I read other posts dealing with *apply vs for-loop performance (Here, for example) and I was aware that improved performance should not be expected but I don't understand why performance here is actually lower.

Here is a reproducible example.

1) Loading the library and selecting the database :

library("biomaRt")
athaliana <- useMart("plants_mart_14")
athaliana <- useDataset("athaliana_eg_gene",mart=athaliana)

2) Querying the database :

loci <- c("at1g01300", "at1g01800", "at1g01900", "at1g02335", "at1g02790", 
"at1g03220", "at1g03230", "at1g04040", "at1g04110", "at1g05240"
)

I create a function for the use in lapply :

foo <- function(loci) {
  getBM("description","tair_locus",loci,athaliana)
}

When I use this function on the first element :

> system.time(foo(cwp_loci[1]))
utilisateur     système      écoulé 
      0.020       0.004       1.599

When I use lapply to retrieve the data for all values :

> system.time(lapply(loci, foo))
utilisateur     système      écoulé 
      0.220       0.000      16.376

I then created a new function, adding a for-loop :

foo2 <- function(loci) {
  for (i in loci) {
    getBM("description","tair_locus",loci[i],athaliana)
  }
}

Here is the result :

> system.time(foo2(loci))
utilisateur     système      écoulé 
      0.204       0.004      10.919

Of course, this will be applied to a big list of loci, so the best performing option is needed. I thank you for assistance.

EDIT Following recommendation of @MartinMorgan

Simply passing the vector loci to getBM greatly improves the query efficiency. Simpler is better.

> system.time(lapply(loci, foo))
utilisateur     système      écoulé 
      0.236       0.024     110.512 

> system.time(foo2(loci))
utilisateur     système      écoulé 
      0.208       0.040     116.099 

> system.time(foo(loci))
utilisateur     système      écoulé 
      0.028       0.000       6.193 

© Stack Overflow or respective owner

Related posts about r

    Related posts about for-loop