Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3.
- by Deepak Konidena
Hi,
I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command
bin/hadoop jar <program>.jar <arg1> <arg2> <path/to/input/file/on/S3>
It throws the following errors (not at the same time.) The first error is thrown when i don't replace the slashes with '%2F' and the second is thrown when i replace them with '%2F':
1) Java.lang.IllegalArgumentException: Invalid hostname in URI S3://<ID>:<SECRETKEY>@<BUCKET>/<path-to-inputfile>
2) org.apache.hadoop.fs.S3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/' XML Error Message: The request signature we calculated does not match the signature you provided. check your key and signing method.
Note:
1)when i submitted jps to see what tasks were running on the Master, it just showed
1116 NameNode
1699 Jps
1180 JobTracker
leaving DataNode and TaskTracker.
2)My Secret key contains two '/' (forward slashes). And i replace them with '%2F' in the S3 URI.
PS: The program runs fine on EC2 when run on a single node. Its only when i launch a cluster, i run into issues related to copying data to/from S3 from/to HDFS. And, what does distcp do? Do i need to distribute the data even after i copy the data from S3 to HDFS?(I thought, HDFS took care of that internally)
IF you could direct me to a link that explains running Map/reduce programs on a hadoop cluster using Amazon EC2/S3. That would be great.
Regards,
Deepak.