incremental way of counting quantiles for large set of data
- by Gacek
I need to count the quantiles for a large set of data.
Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:
List<double> allData = new List<double>();
foreach(var row in matrix) // this is only example. In fact the portions of data are not rows of some matrix
{
allData.AddRange(row);
}
allData.Sort();
double p = 0.75*allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];
Now, I would like to find a way of counting this without storing the data in some separate variable.
The best solution would be to count some parameters od mid-results for first row and then adjust it step by step for next rows.
Note:
These datasets are really big (ca 5000 elements in each row)
The Q3 can be estimated, it doesn't have to be an exact value.
I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!
This question is similar to this one: http://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewness
But I need to count quantiles.
ALso there are few articles in this topic, i.e.:
http://web.cs.wpi.edu/~hofri/medsel.pdf
http://portal.acm.org/citation.cfm?id=347195&dl
But before I would try to implement these, I wanted to ask you if there are maybe any other, qucker ways of counting the 0.25/0.75 quantiles?