Override `drop` for a custom sequence
- by Bruno Reis
In short: in Clojure, is there a way to redefine a function from the standard sequence API (which is not defined on any interface like ISeq, IndexedSeq, etc) on a custom sequence type I wrote?
1. Huge data files
I have big files in the following format:
A long (8 bytes) containing the number n of entries
n entries, each one being composed of 3 longs (ie, 24 bytes)
2. Custom sequence
I want to have a sequence on these entries. Since I cannot usually hold all the data in memory at once, and I want fast sequential access on it, I wrote a class similar to the following:
(deftype DataSeq [id
^long cnt
^long i
cached-seq]
clojure.lang.IndexedSeq
(index [_] i)
(count [_] (- cnt i))
(seq [this] this)
(first [_] (first cached-seq))
(more [this] (if-let [s (next this)] s '()))
(next [_] (if (not= (inc i) cnt)
(if (next cached-seq)
(DataSeq. id cnt (inc i) (next cached-seq))
(DataSeq. id cnt (inc i)
(with-open [f (open-data-file id)]
; open a memory mapped byte array on the file
; seek to the exact position to begin reading
; decide on an optimal amount of data to read
; eagerly read and return that amount of data
))))))
The main idea is to read ahead a bunch of entries in a list and then consume from that list. Whenever the cache is completely consumed, if there are remaining entries, they are read from the file in a new cache list. Simple as that.
To create an instance of such a sequence, I use a very simple function like:
(defn ^DataSeq load-data [id]
(next (DataSeq. id (count-entries id) -1 [])))
; count-entries is a trivial "open file and read a long" memoized
As you can see, the format of the data allowed me to implement count in very simply and efficiently.
3. drop could be O(1)
In the same spirit, I'd like to reimplement drop. The format of these data files allows me to reimplement drop in O(1) (instead of the standard O(n)), as follows:
if dropping less then the remaining cached items, just drop the same amount from the cache and done;
if dropping more than cnt, then just return the empty list.
otherwise, just figure out the position in the data file, jump right into that position, and read data from there.
My difficulty is that drop is not implemented in the same way as count, first, seq, etc. The latter functions call a similarly named static method in RT which, in turn, calls my implementation above, while the former, drop, does not check if the instance of the sequence it is being called on provides a custom implementation.
Obviously, I could provide a function named anything but drop that does exactly what I want, but that would force other people (including my future self) to remember to use it instead of drop every single time, which sucks.
So, the question is: is it possible to override the default behaviour of drop?
4. A workaround (I dislike)
While writing this question, I've just figured out a possible workaround: make the reading even lazier. The custom sequence would just keep an index and postpone the reading operation, that would happen only when first was called. The problem is that I'd need some mutable state: the first call to first would cause some data to be read into a cache, all the subsequent calls would return data from this cache. There would be a similar logic on next: if there's a cache, just next it; otherwise, don't bother populating it -- it will be done when first is called again.
This would avoid unnecessary disk reads. However, this is still less than optimal -- it is still O(n), and it could easily be O(1).
Anyways, I don't like this workaround, and my question is still open. Any thoughts?
Thanks.