Subset a data.frame by list and apply function on each part, by rows

Posted by aL3xa on Stack Overflow See other posts from Stack Overflow or by aL3xa
Published on 2010-02-28T13:47:03Z Indexed on 2010/03/13 11:15 UTC
Read the original article Hit count: 653

Filed under:

lapply

This may seem as a typical plyr problem, but I have something different in mind. Here's the function that I want to optimize (skip the for loop).

# dummy data
set.seed(1985)
lst <- list(a=1:10, b=11:15, c=16:20)
m <- matrix(round(runif(200, 1, 7)), 10)
m <- as.data.frame(m)


dfsub <- function(dt, lst, fun) {
    # check whether dt is `data.frame`
    stopifnot (is.data.frame(dt))
    # check if vectors in lst are "whole" / integer
    # vector elements should be column indexes
    is.wholenumber <- function(x, tol = .Machine$double.eps^0.5)  abs(x - round(x)) < tol
    # fall if any non-integers in list
    idx <- rapply(lst, is.wholenumber)
    stopifnot(idx)
    # check for list length
    stopifnot(ncol(dt) == length(idx))
    # subset the data
    subs <- list()
    for (i in 1:length(lst)) {
            # apply function on each part, by row
            subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun)
    }
    # preserve names
    names(subs) <- names(lst)
    # convert to data.frame
    subs <- as.data.frame(subs)
    # guess what =)
    return(subs)
}

And now a short demonstration... actually, I'm about to explain what I primarily intended to do. I wanted to subset a data.frame by vectors gathered in list object. Since this is a part of code from a function that accompanies data manipulation in psychological research, you can consider m as a results from personality questionnaire (10 subjects, 20 vars). Vectors in list hold column indexes that define questionnaire subscales (e.g. personality traits). Each subscale is defined by several items (columns in data.frame). If we presuppose that the score on each subscale is nothing more than sum (or some other function) of row values (results on that part of questionnaire for each subject), you could run:

> dfsub(m, lst, sum)
    a  b  c
1  46 20 24
2  41 24 21
3  41 13 12
4  37 14 18
5  57 18 25
6  27 18 18
7  28 17 20
8  31 18 23
9  38 14 15
10 41 14 22

I took a glance at this function and I must admit that this little loop isn't spoiling the code at all... BUT, if there's an easier/efficient way of doing this, please, let me know!

Developer IT