My setup:
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
Neo4j 2.0.0-M06 Enterprise
First I made sure I warmed up the cache by executing the following:
START n=node(*) RETURN COUNT(n);
START r=relationship(*) RETURN count(r);
The size of the table is 63,677 nodes and 7,169,995 relationships
Now I have the following query:
START u1=node:node_auto_index('uid:39')
MATCH (u1:user)-[w:WANTS]->(c:card)<-[h:HAS]-(u2:user)
WHERE u2.uid <> 39
WITH u2.uid AS uid, (CASE WHEN w.qty < h.qty THEN w.qty ELSE h.qty END) AS have
RETURN uid, SUM(have) AS total
ORDER BY total DESC
SKIP 0
LIMIT 25
This UID has about 40k+ results that I want to be able to put a pagination to. The initial skip was around 773ms. I tried page 2 (skip 25) and the latency was around the same even up to page 500 it only rose up to 900ms so I didn't really bother. Now I tried some fast forward paging and jumped by thousands so I did 1000, then 2000, then 3000. I was hoping the ORDER BY arrangement will already have been cached by Neo4j and using SKIP will just move to that index in the result and wont have to iterate through each one again. But for each thousand skip I made the latency increased by alot. It's not just cache warming because for one I already warmed up the cache and two, I tried the same skip a couple of times for each skip and it yielded the same results:
SKIP 0: 773ms
SKIP 1000: 1369ms
SKIP 2000: 2491ms
SKIP 3000: 3899ms
SKIP 4000: 5686ms
SKIP 5000: 7424ms
Now who the hell would want to view 5000 pages of results? 40k even?! :) Good point! I will probably put a cap on the maximum results a user can view but I was just curious about this phenomenon. Will somebody please explain why Neo4j seems to be re-iterating through stuff which appears to be already known to it?
Here is my profiling for the 0 skip:
==> ColumnFilter(symKeys=["uid", " INTERNAL_AGGREGATE65c4d6a2-1930-4f32-8fd9-5e4399ce6f14"], returnItemNames=["uid", "total"], _rows=25, _db_hits=0)
==> Slice(skip="Literal(0)", _rows=25, _db_hits=0)
==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE65c4d6a2-1930-4f32-8fd9-5e4399ce6f14 of type Any),false)"], limit="Add(Literal(0),Literal(25))", _rows=25, _db_hits=0)
==> EagerAggregation(keys=["uid"], aggregates=["( INTERNAL_AGGREGATE65c4d6a2-1930-4f32-8fd9-5e4399ce6f14,Sum(have))"], _rows=41659, _db_hits=0)
==> ColumnFilter(symKeys=["have", "u1", "uid", "c", "h", "w", "u2"], returnItemNames=["uid", "have"], _rows=146826, _db_hits=0)
==> Extract(symKeys=["u1", "c", "h", "w", "u2"], exprKeys=["uid", "have"], _rows=146826, _db_hits=587304)
==> Filter(pred="((NOT(Product(u2,uid(0),true) == Literal(39)) AND hasLabel(u1:user(0))) AND hasLabel(u2:user(0)))", _rows=146826, _db_hits=146826)
==> TraversalMatcher(trail="(u1)-[w:WANTS WHERE (hasLabel(NodeIdentifier():card(1)) AND hasLabel(NodeIdentifier():card(1))) AND true]->(c)<-[h:HAS WHERE (NOT(Product(NodeIdentifier(),uid(0),true) == Literal(39)) AND hasLabel(NodeIdentifier():user(0))) AND true]-(u2)", _rows=146826, _db_hits=293696)
And for the 5000 skip:
==> ColumnFilter(symKeys=["uid", " INTERNAL_AGGREGATE99329ea5-03cd-4d53-a6bc-3ad554b47872"], returnItemNames=["uid", "total"], _rows=25, _db_hits=0)
==> Slice(skip="Literal(5000)", _rows=25, _db_hits=0)
==> Top(orderBy=["SortItem(Cached( INTERNAL_AGGREGATE99329ea5-03cd-4d53-a6bc-3ad554b47872 of type Any),false)"], limit="Add(Literal(5000),Literal(25))", _rows=5025, _db_hits=0)
==> EagerAggregation(keys=["uid"], aggregates=["( INTERNAL_AGGREGATE99329ea5-03cd-4d53-a6bc-3ad554b47872,Sum(have))"], _rows=41659, _db_hits=0)
==> ColumnFilter(symKeys=["have", "u1", "uid", "c", "h", "w", "u2"], returnItemNames=["uid", "have"], _rows=146826, _db_hits=0)
==> Extract(symKeys=["u1", "c", "h", "w", "u2"], exprKeys=["uid", "have"], _rows=146826, _db_hits=587304)
==> Filter(pred="((NOT(Product(u2,uid(0),true) == Literal(39)) AND hasLabel(u1:user(0))) AND hasLabel(u2:user(0)))", _rows=146826, _db_hits=146826)
==> TraversalMatcher(trail="(u1)-[w:WANTS WHERE (hasLabel(NodeIdentifier():card(1)) AND hasLabel(NodeIdentifier():card(1))) AND true]->(c)<-[h:HAS WHERE (NOT(Product(NodeIdentifier(),uid(0),true) == Literal(39)) AND hasLabel(NodeIdentifier():user(0))) AND true]-(u2)", _rows=146826, _db_hits=293696)
The only difference is the LIMIT clause on the Top function. I hope we can make this work as intended, I really don't want to delve into doing an embedded Neo4j + my own Jetty REST API for the web app.