Sorting a very large text file in Java
Posted
by Alice
on Stack Overflow
See other posts from Stack Overflow
or by Alice
Published on 2009-12-05T19:57:23Z
Indexed on
2010/03/19
16:11 UTC
Read the original article
Hit count: 179
Hi, I have a large text file I need to sort in Java. The format is:
word [tab] frequency [new line]
The algorithm for sorting is:
- Read some of the file, filtering for purlely alphabetic words.
- Once you have X number of alphabetic words, call Collections.sort and write the result to a file.
- Repeat until you have finished reading the file.
- Start reading two sorted files, comparing line by line for the word with higher frequency, and writing at the same time to a new file as to not load much into your memory
- Repeat until all files are merged into one large file
Right now I've divided the large file into smaller ones (sorted by descending frequency) with 10,000 lines each. I know I need to somehow merge these files back together, but I'm not sure how to go about this.
I've created a LinkedList to keep track of all the files created. The algorithm says to compare each line in the two files, except I've tried a case where , say file1 = 8,6,5,3,1 and file2 = 9,8,8,8,8. Then if I compare them line by line I would get file3 = 9,8,8,6,8,5,8,3,8,1 which is incorrectly sorted (they should be in decreasing order).
I think I'm misunderstanding some part of the algorithm. If someone could point out what I should do instead, I'd greatly appreciate it. Thanks.
edit: Yes this is an assignment. We aren't allowed to increase memory unfortunately :(
© Stack Overflow or respective owner