How should I compress a file with multiple bytes that are the same with Huffman coding?
Posted
by
Omega
on Programmers
See other posts from Programmers
or by Omega
Published on 2012-11-22T01:30:07Z
Indexed on
2012/11/22
5:11 UTC
Read the original article
Hit count: 212
algorithms
|algorithm-analysis
On my great quest for compressing/decompressing files with a Java implementation of Huffman coding (http://en.wikipedia.org/wiki/Huffman_coding) for a school assignment, I am now at the point of building a list of prefix codes. Such codes are used when decompressing a file. Basically, the code is made of zeroes and ones, that are used to follow a path in a Huffman tree (left or right) for, ultimately, finding a byte.
In this Wikipedia image, to reach the character m
the prefix code would be 0111
The idea is that when you compress the file, you will basically convert all the bytes of the file into prefix codes instead (they tend to be smaller than 8 bits, so there's some gain).
So every time the character m
appears in a file (which in binary is actually 1101101
), it will be replaced by 0111
(if we used the tree above).
Therefore, 1101101110110111011011101101
becomes 0111011101110111
in the compressed file.
I'm okay with that. But what if the following happens:
- In the file to be compressed there exists only one unique byte, say
1101101
. - There are 1000 of such byte.
Technically, the prefix code of such byte would be... none, because there is no path to follow, right? I mean, there is only one unique byte anyway, so the tree has just one node.
Therefore, if the prefix code is none, I would not be able to write the prefix code in the compressed file, because, well, there is nothing to write.
Which brings this problem: how would I compress/decompress such file if it is impossible to write a prefix code when compressing? (using Huffman coding, due to the school assignment's rules)
This tutorial seems to explain a bit better about prefix codes: http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html but doesn't seem to address this issue either.
© Programmers or respective owner