What are best practices for collecting, maintaining and ensuring accuracy of a huge data set?
- by Kyle West
I am posing this question looking for practical advice on how to design a system.
Sites like amazon.com and pandora have and maintain huge data sets to run their core business. For example, amazon (and every other major e-commerce site) has millions of products for sale, images of those products, pricing, specifications, etc. etc. etc.
Ignoring the data coming in from 3rd party sellers and the user generated content all that "stuff" had to come from somewhere and is maintained by someone. It's also incredibly detailed and accurate. How? How do they do it? Is there just an army of data-entry clerks or have they devised systems to handle the grunt work?
My company is in a similar situation. We maintain a huge (10-of-millions of records) catalog of automotive parts and the cars they fit. We've been at it for a while now and have come up with a number of programs and processes to keep our catalog growing and accurate; however, it seems like to grow the catalog to x items we need to grow the team to y.
I need to figure some ways to increase the efficiency of the data team and hopefully I can learn from the work of others. Any suggestions are appreciated, more though would be links to content I could spend some serious time reading.
THANKS!
Kyle