liamf - Developer IT

Search Results

Search found 2 results on 1 pages for 'liamf'.

Page 1/1 | 1

Hadoop streaming job on EC2 stays in "pending" state

- by liamf

Trying to experiment with Hadoop and Streaming using cloudera distribution CDH3 on Ubuntu. Have valid data in hdfs:// ready for processing. Wrote little streaming mapper in python. When I launch a mapper only job using: hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar -file /usr/src/mystuff/mapper.py -mapper /usr/src/mystuff/mapper.py -input /incoming/STBFlow/* -output testOP hadoop duly decides it will use 66 mappers on the cluster to process the data. The testOP directory is created on HDFS. A job_conf.xml file is created. But the job tracker UI at port 50030 never shows the job moving out of "pending" state and nothing else happens. CPU usage stays at zero. (the job is created though) If I give it a single file (instead of the entire directory) as input, same result (except Hadoop decides it needs 2 mappers instead of 66). I also tried using the "dumbo" Python utility and launching jobs using that: same result: permanently pending. So I am missing something basic: could someone help me out with what I should look for? The cluster is on Amazon EC2. Firewall issues maybe: ports are enabled explicitly, case by case, in the cluster security group.

Read the article

Does changing the default HDFS replication factor from 3 affect mapper performance?

- by liamf

Have a HDFS/Hadoop cluster setup and am looking into tuning. I wonder if changing the default HDFS replication factor (default:3) to something bigger will improve mapper performance, at the obvious expense of increasing disk storage used? My reasoning being that if the data is already replicated to more nodes, mapper jobs can be run on more nodes in parallel without any data streaming/copying? Anyone got any opinions?