Non standard interaction among two tables to avoid very large merge
- by riko
Suppose I have two tables A and B.
Table A has a multi-level index (a, b) and one column (ts).
b determines univocally ts.
A = pd.DataFrame(
[('a', 'x', 4),
('a', 'y', 6),
('a', 'z', 5),
('b', 'x', 4),
('b', 'z', 5),
('c', 'y', 6)],
columns=['a', 'b', 'ts']).set_index(['a', 'b'])
AA = A.reset_index()
Table B is another one-column (ts) table with non-unique index (a).
The ts's are sorted "inside" each group, i.e., B.ix[x] is sorted for each x.
Moreover, there is always a value in B.ix[x] that is greater than or equal to
the values in A.
B = pd.DataFrame(
dict(a=list('aaaaabbcccccc'),
ts=[1, 2, 4, 5, 7, 7, 8, 1, 2, 4, 5, 8, 9])).set_index('a')
The semantics in this is that B contains observations of occurrences of an event of type indicated by the index.
I would like to find from B the timestamp of the first occurrence of each event type after the timestamp specified in A for each value of b. In other words, I would like to get a table with the same shape of A, that instead of ts contains the "minimum value occurring after ts" as specified by table B.
So, my goal would be:
C:
('a', 'x') 4
('a', 'y') 7
('a', 'z') 5
('b', 'x') 7
('b', 'z') 7
('c', 'y') 8
I have some working code, but is terribly slow.
C = AA.apply(lambda row: (
row[0],
row[1],
B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b'])
Profiling shows the culprit is obviously B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))). However, standard solutions using merge/join would take too much RAM in the long run.
Consider that now I have 1000 a's, assume constant the average number of b's per a (probably 100-200), and consider that the number of observations per a is probably in the order of 300. In production I will have 1000 more a's.
1,000,000 x 200 x 300 = 60,000,000,000 rows
may be a bit too much to keep in RAM, especially considering that the data I need is perfectly described by a C like the one I discussed above.
How would I improve the performance?