Sorry I can't post code but I have a bufferedreader with 50000000 bytes set as the buffer size. It works as you would expect for half an hour, the HDD light flashing every two minutes or so, reading in the big chunk of data, and then going quiet again as the CPU processes it. But after about half an hour (this is a very big file), the HDD starts thrashing as if it is reading one byte at a time. It is still in the same loop and I think I checked free ram to rule out swapping (heap size is default).
Probably won't get any helpful answers, but worth a try.
OK I have changed heap size to 768mb and still nothing. There is plenty of free memory and java.exe is only using about 300mb.
Now I have profiled it and heap stays at about 200MB, well below what is available. CPU stays at 50%. Yet the HDD starts thrashing like crazy. I have.. no idea. I am going to rewrite the whole thing in c#, that is my solution.
Here is the code (it is just a throw-away script, not pretty):
BufferedReader s = null;
HashMap<String, Integer> allWords = new HashMap<String, Integer>();
HashSet<String> pageWords = new HashSet<String>();
long[] pageCount = new long[78592];
long pages = 0;
Scanner wordFile = new Scanner(new BufferedReader(new FileReader("allWords.txt")));
while (wordFile.hasNext()) {
allWords.put(wordFile.next(), Integer.parseInt(wordFile.next()));
}
s = new BufferedReader(new FileReader("wikipedia/enwiki-latest-pages-articles.xml"), 50000000);
StringBuilder words = new StringBuilder();
String nextLine = null;
while ((nextLine = s.readLine()) != null) {
if (a.matcher(nextLine).matches()) {
continue;
}
else if (b.matcher(nextLine).matches()) {
continue;
}
else if (c.matcher(nextLine).matches()) {
continue;
}
else if (d.matcher(nextLine).matches()) {
nextLine = s.readLine();
if (e.matcher(nextLine).matches()) {
if (f.matcher(s.readLine()).matches()) {
pageWords.addAll(Arrays.asList(words.toString().toLowerCase().split("[^a-zA-Z]")));
words.setLength(0);
pages++;
for (String word : pageWords) {
if (allWords.containsKey(word)) {
pageCount[allWords.get(word)]++;
}
else if (!word.isEmpty() && allWords.containsKey(word.substring(0, word.length() - 1))) {
pageCount[allWords.get(word.substring(0, word.length() - 1))]++;
}
}
pageWords.clear();
}
}
}
else if (g.matcher(nextLine).matches()) {
continue;
}
words.append(nextLine);
words.append(" ");
}