How should I compress a file with multiple bytes that are the same with Huffman coding?
- by Omega
On my great quest for compressing/decompressing files with a Java implementation of Huffman coding (http://en.wikipedia.org/wiki/Huffman_coding) for a school assignment, I am now at the point of building a list of prefix codes. Such codes are used when decompressing a file. Basically, the code is made of zeroes and ones, that are used to follow a path in a Huffman tree (left or right) for, ultimately, finding a byte.
In this Wikipedia image, to reach the character m the prefix code would be 0111
The idea is that when you compress the file, you will basically convert all the bytes of the file into prefix codes instead (they tend to be smaller than 8 bits, so there's some gain).
So every time the character m appears in a file (which in binary is actually 1101101), it will be replaced by 0111 (if we used the tree above).
Therefore, 1101101110110111011011101101 becomes 0111011101110111 in the compressed file.
I'm okay with that. But what if the following happens:
In the file to be compressed there exists only one unique byte, say 1101101.
There are 1000 of such byte.
Technically, the prefix code of such byte would be... none, because there is no path to follow, right? I mean, there is only one unique byte anyway, so the tree has just one node.
Therefore, if the prefix code is none, I would not be able to write the prefix code in the compressed file, because, well, there is nothing to write.
Which brings this problem: how would I compress/decompress such file if it is impossible to write a prefix code when compressing? (using Huffman coding, due to the school assignment's rules)
This tutorial seems to explain a bit better about prefix codes: http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html but doesn't seem to address this issue either.