Non standard interaction among two tables to avoid very large merge
Posted
by
riko
on Stack Overflow
See other posts from Stack Overflow
or by riko
Published on 2012-12-17T16:49:25Z
Indexed on
2012/12/17
17:03 UTC
Read the original article
Hit count: 244
Suppose I have two tables A
and B
.
Table A
has a multi-level index (a, b)
and one column (ts).
b
determines univocally ts.
A = pd.DataFrame(
[('a', 'x', 4),
('a', 'y', 6),
('a', 'z', 5),
('b', 'x', 4),
('b', 'z', 5),
('c', 'y', 6)],
columns=['a', 'b', 'ts']).set_index(['a', 'b'])
AA = A.reset_index()
Table B
is another one-column (ts) table with non-unique index (a
).
The ts's are sorted "inside" each group, i.e., B.ix[x]
is sorted for each x.
Moreover, there is always a value in B.ix[x]
that is greater than or equal to
the values in A
.
B = pd.DataFrame(
dict(a=list('aaaaabbcccccc'),
ts=[1, 2, 4, 5, 7, 7, 8, 1, 2, 4, 5, 8, 9])).set_index('a')
The semantics in this is that B
contains observations of occurrences of an event of type indicated by the index.
I would like to find from B
the timestamp of the first occurrence of each event type after the timestamp specified in A
for each value of b
. In other words, I would like to get a table with the same shape of A
, that instead of ts contains the "minimum value occurring after ts" as specified by table B
.
So, my goal would be:
C:
('a', 'x') 4
('a', 'y') 7
('a', 'z') 5
('b', 'x') 7
('b', 'z') 7
('c', 'y') 8
I have some working code, but is terribly slow.
C = AA.apply(lambda row: (
row[0],
row[1],
B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b'])
Profiling shows the culprit is obviously B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2])))
. However, standard solutions using merge/join would take too much RAM in the long run.
Consider that now I have 1000 a
's, assume constant the average number of b's per a (probably 100-200), and consider that the number of observations per a is probably in the order of 300. In production I will have 1000 more a
's.
1,000,000 x 200 x 300 = 60,000,000,000
rows
may be a bit too much to keep in RAM, especially considering that the data I need is perfectly described by a C like the one I discussed above.
How would I improve the performance?
© Stack Overflow or respective owner