Efficient way to get highly correlated pairs from large data set in Python or R
Posted
by
Akavall
on Stack Overflow
See other posts from Stack Overflow
or by Akavall
Published on 2012-06-29T20:55:35Z
Indexed on
2012/06/29
21:16 UTC
Read the original article
Hit count: 179
I have a large data set (Let's say 10,000 variables with about 1000 elements each), we can think of it as 2D list, something like:
[[variable_1],
[variable_2],
............
[variable_n]
]
I want to extract highly correlated variable pairs from that data. I want "highly correlated" to be a parameter that I can choose.
I don't need all pairs to be extracted, and I don't necessarily want the most correlated pairs. As long as there is an efficient method that gets me highly correlated pairs I am happy.
Also, it would be nice if a variable does not show up in more than one pair. Although this might not be crucial.
Of course, there is a brute force way to finding such pairs, but it is too slow for me.
I've googled around for a bit and found some theoretical work on this issue, but I wasn't able for find a package that could do what I am looking for. I mostly work in python, so a package in python would be most helpful, but if there exists a package in R that does what I am looking for it will be great.
Does anyone know of a package that does the above in Python or R? Or any other ideas?
Thank You in Advance
© Stack Overflow or respective owner