Calculating a Sample Covariance Matrix for Groups with plyr

Posted by John A. Ramey on Stack Overflow See other posts from Stack Overflow or by John A. Ramey
Published on 2010-04-28T07:13:09Z Indexed on 2010/04/28 12:23 UTC
Read the original article Hit count: 325

Filed under:

I'm going to use the sample code from http://gettinggeneticsdone.blogspot.com/2009/11/split-apply-and-combine-in-r-using-plyr.html for this example. So, first, let's copy their example data:

mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2),
SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)),
SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))

I am going to ignore SNP2 in this example and just pretend the values in SNP1 denote group membership. So then, I may want some summary statistics about each group in SNP1: "AA", "Aa", "aa".

Then if I want to calculate the means for each variable, it makes sense (modifying their code slightly) to use:

> ddply(mydata, c("SNP1"), function(df)
data.frame(meanX1=mean(df$X1), meanX2=mean(df$X2)))
  SNP1      meanX1   meanX2
1   aa  0.05178028 4.812302
2   Aa  0.30586206 4.820739
3   AA -0.26862500 4.856006

But what if I want the sample covariance matrix for each group? Ideally, I would like a 3D array, where the I have the covariance matrix for each group, and the third dimension denotes the corresponding group. I tried a modified version of the previous code and got the following results that have convinced me that I'm doing something wrong.

> daply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))
, ,  = 1


SNP1         1          2
  aa 1.4961210 -0.9496134
  Aa 0.8833190 -0.1640711
  AA 0.9942357 -0.9955837

, ,  = 2


SNP1          1        2
  aa -0.9496134 2.881515
  Aa -0.1640711 2.466105
  AA -0.9955837 4.938320

I was thinking that the dim() of the 3rd dimension would be 3, but instead, it is 2. Really this is a sliced up version of the covariance matrix for each group. If we manually compute the sample covariance matrix for aa, we get:

           [,1]       [,2]
[1,]  1.4961210 -0.9496134
[2,] -0.9496134  2.8815146

Using plyr, the following gives me what I want in list() form:

> dlply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))
$aa
           [,1]       [,2]
[1,]  1.4961210 -0.9496134
[2,] -0.9496134  2.8815146

$Aa
           [,1]       [,2]
[1,]  0.8833190 -0.1640711
[2,] -0.1640711  2.4661046

$AA
           [,1]       [,2]
[1,]  0.9942357 -0.9955837
[2,] -0.9955837  4.9383196

attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
  SNP1
1   aa
2   Aa
3   AA

But like I said earlier, I would really like this in a 3D array. Any thoughts on where I went wrong with daply() or suggestions? Of course, I could typecast the list from dlply() to a 3D array, but I'd rather not do this because I will be repeating this process many times in a simulation.

As a side note, I found one method (http://www.mail-archive.com/[email protected]/msg86328.html) that provides the sample covariance matrix for each group, but the outputted object is bloated.

Thanks in advance.

© Stack Overflow or respective owner

Related posts about r