How to deduplicate 40TB of data?

Posted by Michael Stauffer on Super User See other posts from Super User or by Michael Stauffer
Published on 2014-08-22T19:40:51Z Indexed on 2014/08/23 4:23 UTC
Read the original article Hit count: 550

Filed under:

centos-6

|

deduplication

I've inherited a research cluster with ~40TB of data across three filesystems. The data stretches back almost 15 years, and there are most likely a good amount of duplicates as researchers copy each others data for different reasons and then just hang on to the copies.

I know about de-duping tools like fdupes and rmlint. I'm trying to find one that will work on such a large dataset. I don't care if it takes weeks (or maybe even months) to crawl all the data - I'll probably throttle it anyway to go easy on the filesystems. But I need to find a tool that's either somehow super efficient with RAM, or can store all the intermediary data it needs in files rather than RAM. I'm assuming that my RAM (64GB) will be exhausted if I crawl through all this data as one set.

I'm experimenting with fdupes now on a 900GB tree. It's 25% of the way through and RAM usage has been slowly creeping up the whole time, now it's at 700MB.

Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM?

I'm running CentOS 6.

© Super User or respective owner

Related posts about centos-6

Problems with repositories on CentOS 3.9

as seen on Super User - Search for 'Super User'
Hello, I have CentOS 3.9 for i386. When I try to instal some thing with yum, i.e: yum install firefox or yum install firefox* or yum list firefox and so on, I get: +++++++++++++++++++ yum info firefox Gathering header information file(s) from server(s) Server: CentOS-3 - Addons Server: CentOS-3 -… >>> More
Problems with repositories on CentOS 3.9

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, I have CentOS 3.9 for i386. When I try to instal some thing with yum, i.e: yum install firefox or yum install firefox* or yum list firefox and so on, I get: +++++++++++++++++++ yum info firefox Gathering header information file(s) from server(s) Server: CentOS-3 - Addons Server: CentOS-3 -… >>> More
centos yum problems

as seen on Server Fault - Search for 'Server Fault'
I am really new to using linux and have just formatted my centos 5.2 vps and am trying to install links by using the command yum install links. But the following error gets displayed: [root@inverses ~]# yum install links Loading "fastestmirror" plugin Loading mirror speeds from cached hostfile *… >>> More
Networking issues with Linux server (CentOS 5.3)

as seen on Server Fault - Search for 'Server Fault'
I have a Linux server hosting our bug tracking software (CentOS 5.2 Kernel 2.6.18-128.4.1.el5) that I have having some strange network problems with. The machine is configured with two NICS, one for the public interface and the other for our server back end network. The problem is that after doing… >>> More
Networking issues with Linux server (CentOS 5.3)

as seen on Server Fault - Search for 'Server Fault'
I have a Linux server hosting our bug tracking software (CentOS 5.2 Kernel 2.6.18-128.4.1.el5) that I have having some strange network problems with. The machine is configured with two NICS, one for the public interface and the other for our server back end network. The problem is that after doing… >>> More

Related posts about deduplication

IBM Offers Mainframe Data Deduplication

as seen on Internet.com - Search for 'Internet.com'
Big Blue unveils a virtual tape library with deduplication capabilities for backing up z/OS environments. >>> More
Symantec Adds Deduplication to Backup Software

as seen on Internet.com - Search for 'Internet.com'
Backup Exec and NetBackup get new data dedupe features, along with better virtual machine protection and archiving and replication capabilities. >>> More
NetApp Halts VTL Deduplication Development

as seen on Internet.com - Search for 'Internet.com'
Six months after losing a bidding war for Data Domain to EMC, NetApp backs away from the market for VTL deduplication. >>> More
IBM Brings Deduplication to Mainframes

as seen on Internet.com - Search for 'Internet.com'
Big Blue unveils a virtual tape library with deduplication capabilities for backing up z/OS environments. >>> More
Symantec Backup Gets Deduplication Features

as seen on Internet.com - Search for 'Internet.com'
The latest entry in the deduplication market is an old hand at security and system management. What does Symantec bring to the table? >>> More