Efficient way to get highly correlated pairs from large data set in Python or R

Posted by Akavall on Stack Overflow See other posts from Stack Overflow or by Akavall
Published on 2012-06-29T20:55:35Z Indexed on 2012/06/29 21:16 UTC
Read the original article Hit count: 192

Filed under:
|
|
|

I have a large data set (Let's say 10,000 variables with about 1000 elements each), we can think of it as 2D list, something like:

[[variable_1],
 [variable_2],
 ............
 [variable_n]
]

I want to extract highly correlated variable pairs from that data. I want "highly correlated" to be a parameter that I can choose.

I don't need all pairs to be extracted, and I don't necessarily want the most correlated pairs. As long as there is an efficient method that gets me highly correlated pairs I am happy.

Also, it would be nice if a variable does not show up in more than one pair. Although this might not be crucial.

Of course, there is a brute force way to finding such pairs, but it is too slow for me.

I've googled around for a bit and found some theoretical work on this issue, but I wasn't able for find a package that could do what I am looking for. I mostly work in python, so a package in python would be most helpful, but if there exists a package in R that does what I am looking for it will be great.

Does anyone know of a package that does the above in Python or R? Or any other ideas?

Thank You in Advance

© Stack Overflow or respective owner

Related posts about python

Related posts about algorithm