Sorting a very large text file in Java

Posted by Alice on Stack Overflow See other posts from Stack Overflow or by Alice
Published on 2009-12-05T19:57:23Z Indexed on 2010/03/19 16:11 UTC
Read the original article Hit count: 153

Filed under:
|
|
|

Hi, I have a large text file I need to sort in Java. The format is:

word [tab] frequency [new line]

The algorithm for sorting is:

  • Read some of the file, filtering for purlely alphabetic words.
  • Once you have X number of alphabetic words, call Collections.sort and write the result to a file.
  • Repeat until you have finished reading the file.
  • Start reading two sorted files, comparing line by line for the word with higher frequency, and writing at the same time to a new file as to not load much into your memory
  • Repeat until all files are merged into one large file

Right now I've divided the large file into smaller ones (sorted by descending frequency) with 10,000 lines each. I know I need to somehow merge these files back together, but I'm not sure how to go about this.

I've created a LinkedList to keep track of all the files created. The algorithm says to compare each line in the two files, except I've tried a case where , say file1 = 8,6,5,3,1 and file2 = 9,8,8,8,8. Then if I compare them line by line I would get file3 = 9,8,8,6,8,5,8,3,8,1 which is incorrectly sorted (they should be in decreasing order).

I think I'm misunderstanding some part of the algorithm. If someone could point out what I should do instead, I'd greatly appreciate it. Thanks.

edit: Yes this is an assignment. We aren't allowed to increase memory unfortunately :(

© Stack Overflow or respective owner

Related posts about sorting

Related posts about text-file