How to deduplicate 40TB of data?
- by Michael Stauffer
I've inherited a research cluster with ~40TB of data across three filesystems. The data stretches back almost 15 years, and there are most likely a good amount of duplicates as researchers copy each others data for different reasons and then just hang on to the copies.
I know about de-duping tools like fdupes and rmlint. I'm trying to find one that will work on such a large dataset. I don't care if it takes weeks (or maybe even months) to crawl all the data - I'll probably throttle it anyway to go easy on the filesystems. But I need to find a tool that's either somehow super efficient with RAM, or can store all the intermediary data it needs in files rather than RAM. I'm assuming that my RAM (64GB) will be exhausted if I crawl through all this data as one set.
I'm experimenting with fdupes now on a 900GB tree. It's 25% of the way through and RAM usage has been slowly creeping up the whole time, now it's at 700MB.
Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM?
I'm running CentOS 6.