Multiple lines of text to a single map
- by steven
I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already.
I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line].
I have tried to set the option and it only takes N lines of input sending it at 1 line at a time to each map:
job.setInt("mapred.line.input.format.linespermap", 10);
I've found a mailing list recommending me to override LineRecordReader::next, however that is not that simple, as that the internal data members are all private.
I've just checked the source for NLineInputFormat and it hard codes LineReader, so overriding will not help.
Also, btw I'm using Hadoop 0.18 for compatibility with the Amazon EC2 MapReduce.