faster way to compare rows in a data frame

Posted by aguiar on Stack Overflow See other posts from Stack Overflow or by aguiar
Published on 2013-11-01T03:49:40Z Indexed on 2013/11/01 3:53 UTC
Read the original article Hit count: 147

Filed under:

Consider the data frame below. I want to compare each row with rows below and then take the rows that are equal in more than 3 values.

I wrote the code below, but it is very slow if you have a large data frame.

How could I do that faster?

data <- as.data.frame(matrix(c(10,11,10,13,9,10,11,10,14,9,10,10,8,12,9,10,11,10,13,9,13,13,10,13,9), nrow=5, byrow=T))
rownames(data)<-c("sample_1","sample_2","sample_3","sample_4","sample_5")

>data
          V1 V2 V3 V4 V5
sample_1  10 11 10 13  9
sample_2  10 11 10 14  9
sample_3  10 10  8 12  9
sample_4  10 11 10 13  9
sample_5  13 13 10 13  9

tab <- data.frame(sample = NA, duplicate = NA, matches = NA)
dfrow <- 1
for(i in 1:nrow(data)) {
    sample <- data[i, ]
    for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
    matches <- 0
        for(V in 1:ncol(data)) {
            if(data[j,V] == sample[,V]) {       
                matches <- matches + 1
            }
        }
        if(matches > 3) {
            duplicate <- data[j, ]
            pair <- cbind(rownames(sample), rownames(duplicate), matches)
            tab[dfrow, ] <- pair
            dfrow <- dfrow + 1
        }
    }
}

>tab
   sample    duplicate    matches
1 sample_1   sample_2     4
2 sample_1   sample_4     5
3 sample_2   sample_4     4

© Stack Overflow or respective owner

Related posts about r