ZFS & Deduplicating FLAC Data
- by jasongullickson
I'm experimenting with using ZFS to deduplicate a large library of FLAC files. The purpose of this is twofold:
Reduce storage utilization
Reduce bandwidth needed to sync the library with cloud storage
Many of these files are of the same music tracks but from different physical media. This means that for the most part they are the same and usually close to the same size, which makes me think that they should benefit from block-level deduplication.
However in my testing I'm not seeing good results. When I create a pool and add three of these tracks (identical songs from different source media) zpool list reports 1.00 dedupe. If I copy all of the files (make exact duplicates of the three) dedupe climbs, so I know that it is enabled and functioning, but it's not finding any duplication in the original collection of files.
My first thought was that perhaps some of the variable header data (metadata tags, etc.) might be mis-aligning the bulk of the data in these files (the audio frames) but even making the header data consistent across the three files doesn't seem to have any impact on deduplication.
I'm considering taking alternate routes (testing other dedupe filesystems as well as some custom code) but since we're already using ZFS and I like the ZFS replication options, I'd prefer to use ZFS dedupe for this project; but perhaps it's simply not capable of working well with this sort of data.
Any feedback regarding tuning that might improve dedupe performance for this sort of dataset, or confirmation that ZFS dedupe is not the right tool for this job are appreciated.