First of all, thanks for reading, and sorry for asking something related to my job. I understand that this is something that I should solve by myself but as you will see its something a bit difficult.
A small description:
Now
Storage = 1PB using DDN S2A9900 storage for the OSTs, 4 OSS , 10 GigE network. (lustre 1.6)
100 compute nodes with 2x Infiniband
1 infiniband switch with 36 ports
After
Storage = Previous storage + another 1PB using DDN S2A 990 or LSI E5400 (still to decide) (lustre 2.0)
8 OSS , 10GigE network
100 compute nodes with 2x Infiniband
Previous experience: transfered 120 TB in less than 3 days
using following command:
tar -C /old --record-size 2048 -b 2048 -cf - dir | tar -C /new
--record-size 2048 -b 2048 -xvf - 2>&1 | tee /tmp/dir.log
So , big problem here, using big mathematical equations I conclude that we are going to need 1 month to transfer the data from one side to the new one. During this time the researchers will need to step back, and I'm personally not happy with this.
I'm telling you that we have infiniband connections because I think that may be there is a chance to use it to transfer the data using 18 compute nodes (18 * 2 IB = 36 ports) to transfer the data from one storage to the other. I'm trying to figure out if the IB switch will handle all the traffic but in case it just burn up will go faster than using 10GigE.
Also, having lustre 1.6 and 2.0 agents on same server works quite well, with this there is no need to go by 1.8 to upgrade the metadata servers with two steps.
Any ideas?
Many thanks
Note 1:
Zoredache, we can divide it in two blocks (A)600Tb and (B)400Tb. The idea is to move (A) to new storage which is lustre2.0 formated, then format where (A) was with lustre2.0 and move (B) to this lustre2.0 block and extend with the space where (B) was.
This way we will end with (A) and (B) on separate filesystems, with 1PB each.