Hadoop streaming job on EC2 stays in "pending" state
- by liamf
Trying to experiment with Hadoop and Streaming using cloudera distribution CDH3 on Ubuntu.
Have valid data in hdfs:// ready for processing.
Wrote little streaming mapper in python.
When I launch a mapper only job using:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar -file /usr/src/mystuff/mapper.py -mapper /usr/src/mystuff/mapper.py -input /incoming/STBFlow/* -output testOP
hadoop duly decides it will use 66 mappers on the cluster to process the data.
The testOP directory is created on HDFS. A job_conf.xml file is created.
But the job tracker UI at port 50030 never shows the job moving out of "pending" state and nothing else happens. CPU usage stays at zero. (the job is created though)
If I give it a single file (instead of the entire directory) as input, same result (except Hadoop decides it needs 2 mappers instead of 66).
I also tried using the "dumbo" Python utility and launching jobs using that: same result: permanently pending.
So I am missing something basic: could someone help me out with what I should look for?
The cluster is on Amazon EC2. Firewall issues maybe: ports are enabled explicitly, case by case, in the cluster security group.