I searched SOverflow pretty extensively for something similar to this set of questions.
BACKGROUND:
We are a growing 'big(ish)' data chemical data company that are outgrowing our lab and our dedicated production workhorses.
Make no mistake, we need to do some serious query optimization. Our data (It comes from a certain govt. agency so the schema and lack of indexing is atrocious). So yes, I know, AWS or EC2 is not a silver bullet in the face of spending time to maybe rework your queries/code entirely 'out of the box'.
With that said I would appreciate any input on the following questions:
We produce on CentOS and lab on Ubuntu LTS which I prefer especially
with their growing cloud / AWS integration. If we are mysql centric,
and our biggest problem is these big cartesian products that produce
slow queries, should we roll out what we know after more
optimization with respect to Ubuntu/mySQL with the added Amazon
horsepower?
Or is there some merit to the NoSQL and other technologies they offer?
What are the key metrics I need to gather from apache and mysql other than like: Disk I/O operations, Data up/down avgs and trends
and special high usage periods/scenarios? I've reviewed AWS/EC2 fine
print, but want 2nd opinions.
What other services aside from the basic web/database have proven valuable to you?
I know nothing of Hadoop or many other technologies they offer, echoing my prev. question, do you sometimes find it worth it (Initially having it be a gamble aside from basic homework) to
dive/break into a whole new environment and try to/or end up finding a way of more efficiently
producing your data/site product?
Anything I should watch out for in projecting costs, or any other general advice when working with AWS folks from anyone else where your company is very
niche and very very technical (Scientifically - or anybody for that matter)?
Thanks very much for your input - I think this thread could be valuable to others as well.