Versioning millions of files with distributed SCM
- by C. Lawrence Wenham
I'm looking into the feasibility of using off-the-shelf distributed SCMs such as Git or Mercurial to manage millions of XML files. Each file would be a commercial transaction, such as a purchase order, that would be updated perhaps 10 times during the lifecycle of the transaction until it is "done" and changes no more.
And by "manage", I mean that the SCM would be used to not just version the files, but also to replicate them to other machines for redundancy and transfer of IP.
Lets suppose, for the sake of example, that a goal is to provide good performance if it was handling the volume of orders that Amazon.com claimed to have at its peak in December 2010: about 150,000 orders per minute.
We're expecting the system to be distributed over many servers in order to get reasonable performance. We're also planning to use solid-state drives exclusively.
There is a reason why we don't want to use an RDBMS for primary storage, but it's a bit beyond the scope of this question.
Does anyone have first-hand experience with the performance of distributed SCMs under such a load, and what strategies were used?
Open-source preferred, since the final product is to be FOSS, too.