Best practice for administering a (hadoop) cluster

Posted by Alex on Server Fault See other posts from Server Fault or by Alex
Published on 2011-03-08T07:23:32Z Indexed on 2011/03/08 16:12 UTC
Read the original article Hit count: 212

Filed under:
|

Dear all,

I've recently been playing with Hadoop. I have a six node cluster up and running - with HDFS, and having run a number of MapRed jobs. So far, so good. However I'm now looking to do this more systematically and with a larger number of nodes. Our base system is Ubuntu and the current setup has been administered using apt (to install the correct java runtime) and ssh/scp (to propagate out the various conf files). This is clearly not scalable over time.

Does anyone have any experience of good systems for administering (possibly slightly heterogenous: different disk sizes, different numbers of cpus on each node) hadoop clusters automagically? I would consider diskless boot - but imagine that with a large cluster, getting the cluster up and running might be bottle-necked on the machine serving the OS. Or some form of distributed debian apt to keep the machines native environment synchronised? And how do people successfully manage the conf files over a number of (potentially heterogenous) machines?

Thanks very much in advance,

Alex

© Server Fault or respective owner

Related posts about hadoop

Related posts about mapreduce