Lazy Processing of Streams
- by Giorgio
I have the following problem scenario:
I have a text file and I have to read it and split it into lines.
Some lines might need to be dropped (according to criteria that are not fixed).
The lines that are not dropped must be parsed into some predefined records.
Records that are not valid must be dropped.
Duplicate records may exist and, in such a case, they are consecutive. If duplicate / multiple records exist, only one item should be kept.
The remaining records should be grouped according to the value contained in one field; all records belonging to the same group appear one after another (e.g. AAAABBBBCCDEEEFF and so on).
The records of each group should be numbered (1, 2, 3, 4, ...). For each group the numbering starts from 1.
The records must then be saved somewhere / consumed in the same order as they were produced.
I have to implement this in Java or C++.
My first idea was to define functions / methods like:
One method to get all the lines from the file.
One method to filter out the unwanted lines.
One method to parse the filtered lines into valid records.
One method to remove duplicate records.
One method to group records and number them.
The problem is that the data I am going to read can be too big and might not fit into main memory: so I cannot just construct all these lists and apply my functions one after the other.
On the other hand, I think I do not need to fit all the data in main memory at once because once a record has been consumed all its underlying data (basically the lines of text between the previous record and the current record, and the record itself) can be disposed of.
With the little knowledge I have of Haskell I have immediately thought about some kind of lazy evaluation, in which instead of applying functions to lists that have been completely computed, I have different streams of data that are built on top of each other and, at each moment, only the needed portion of each stream is materialized in main memory.
But I have to implement this in Java or C++. So my question is which design pattern or other technique can allow me to implement this lazy processing of streams in one of these languages.