I'm trying to do a simple genomic track intersection in R, and running into major performance problems, probably related to my use of for loops.
In this situation, I have pre-defined windows at intervals of 100bp and I'm trying to calculate how much of each window is covered by the annotations in mylist. Graphically, it looks something like this:
0 100 200 300 400 500 600
windows: |-----|-----|-----|-----|-----|-----|
mylist: |-| |-----------|
So I wrote some code to do just that, but it's fairly slow and has become a bottleneck in my code:
##window for each 100-bp segment
windows <- numeric(6)
##second track
mylist = vector("list")
mylist[[1]] = c(1,20)
mylist[[2]] = c(120,320)
##do the intersection
for(i in 1:length(mylist)){
st <- floor(mylist[[i]][1]/100)+1
sp <- floor(mylist[[i]][2]/100)+1
for(j in st:sp){
b <- max((j-1)*100, mylist[[i]][1])
e <- min(j*100, mylist[[i]][2])
windows[j] <- windows[j] + e - b + 1
}
}
print(windows)
[1] 20 81 101 21 0 0
Naturally, this is being used on data sets that are much larger than the example I provide here. Through some profiling, I can see that the bottleneck is in the for loops, but my clumsy attempt to vectorize it using *apply functions resulted in code that runs an order of magnitude more slowly.
I suppose I could write something in C, but I'd like to avoid that if possible. Can anyone suggest another approach that will speed this calculation up?