Calculating a Sample Covariance Matrix for Groups with plyr
Posted
by John A. Ramey
on Stack Overflow
See other posts from Stack Overflow
or by John A. Ramey
Published on 2010-04-28T07:13:09Z
Indexed on
2010/04/28
12:23 UTC
Read the original article
Hit count: 325
r
I'm going to use the sample code from http://gettinggeneticsdone.blogspot.com/2009/11/split-apply-and-combine-in-r-using-plyr.html for this example. So, first, let's copy their example data:
mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2),
SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)),
SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))
I am going to ignore SNP2 in this example and just pretend the values in SNP1 denote group membership. So then, I may want some summary statistics about each group in SNP1: "AA", "Aa", "aa".
Then if I want to calculate the means for each variable, it makes sense (modifying their code slightly) to use:
> ddply(mydata, c("SNP1"), function(df)
data.frame(meanX1=mean(df$X1), meanX2=mean(df$X2)))
SNP1 meanX1 meanX2
1 aa 0.05178028 4.812302
2 Aa 0.30586206 4.820739
3 AA -0.26862500 4.856006
But what if I want the sample covariance matrix for each group? Ideally, I would like a 3D array, where the I have the covariance matrix for each group, and the third dimension denotes the corresponding group. I tried a modified version of the previous code and got the following results that have convinced me that I'm doing something wrong.
> daply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))
, , = 1
SNP1 1 2
aa 1.4961210 -0.9496134
Aa 0.8833190 -0.1640711
AA 0.9942357 -0.9955837
, , = 2
SNP1 1 2
aa -0.9496134 2.881515
Aa -0.1640711 2.466105
AA -0.9955837 4.938320
I was thinking that the dim() of the 3rd dimension would be 3, but instead, it is 2. Really this is a sliced up version of the covariance matrix for each group. If we manually compute the sample covariance matrix for aa, we get:
[,1] [,2]
[1,] 1.4961210 -0.9496134
[2,] -0.9496134 2.8815146
Using plyr, the following gives me what I want in list() form:
> dlply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))
$aa
[,1] [,2]
[1,] 1.4961210 -0.9496134
[2,] -0.9496134 2.8815146
$Aa
[,1] [,2]
[1,] 0.8833190 -0.1640711
[2,] -0.1640711 2.4661046
$AA
[,1] [,2]
[1,] 0.9942357 -0.9955837
[2,] -0.9955837 4.9383196
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
SNP1
1 aa
2 Aa
3 AA
But like I said earlier, I would really like this in a 3D array. Any thoughts on where I went wrong with daply() or suggestions? Of course, I could typecast the list from dlply() to a 3D array, but I'd rather not do this because I will be repeating this process many times in a simulation.
As a side note, I found one method (http://www.mail-archive.com/[email protected]/msg86328.html) that provides the sample covariance matrix for each group, but the outputted object is bloated.
Thanks in advance.
© Stack Overflow or respective owner