Discover periodic patterns in a large data-set

Posted by Miner on Stack Overflow See other posts from Stack Overflow or by Miner
Published on 2010-04-06T01:27:39Z Indexed on 2010/04/06 1:33 UTC
Read the original article Hit count: 356

Filed under:
|

I have a large sequence of tuples on disk in the form (t1, k1) (t2, k2) ... (tn, kn)

ti is a monotonically increasing timestamp and ki is a key (assume a fixed length string if needed). Neither ti nor ki are guaranteed to be unique. However, the number of unique tis and kis is huge (millions). n itself is very large (100 Million+) and the size of k (approx 500 bytes) makes it impossible to store everything in memory.

I would like to find out periodic occurrences of keys in this sequence.

For example, if I have the sequence (1, a) (2, b) (3, c) (4, b) (5, a) (6, b) (7, d) (8, b) (9, a) (10, b)

The algorithm should emit (a, 4) and (b, 2). That is a occurs with a period of 4 and b occurs with a period of 2.

If I build a hash of all keys and store the average of the difference between consecutive timestamps of each key and a std deviation of the same, I might be able to make a pass, and report only the ones that have an acceptable std deviation(ideally, 0). However, it requires one bucket per unique key, whereas in practice, I might have very few really periodic patterns. Any better ways?

© Stack Overflow or respective owner

Related posts about algorithm

Related posts about algorithm-design