How does Hadoop perform input splits?

Posted by Deepak Konidena on Stack Overflow See other posts from Stack Overflow or by Deepak Konidena
Published on 2010-05-14T02:27:21Z Indexed on 2010/05/14 2:34 UTC
Read the original article Hit count: 455

Filed under:

apache-hadoop

|

mapreduce

Hi,

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.

Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?

All i want to know is, whether the splits are done internally or do we have to split the data manually?

More specifically, each time the map() function is called what are its Key key and Value val parameters?

Thanks, Deepak

© Stack Overflow or respective owner

Related posts about apache-hadoop

Combining HBase and HDFS results in Exception in makeDirOnFileSystem

as seen on Server Fault - Search for 'Server Fault'
Introduction An attempt to combine HBase and HDFS results in the following: 2014-06-09 00:15:14,777 WARN org.apache.hadoop.hbase.HBaseFileSystem: Create Dir ectory, retries exhausted 2014-06-09 00:15:14,780 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java… >>> More
Problem compiling hive with ant

as seen on Stack Overflow - Search for 'Stack Overflow'
I compiling with Solaris 10 SPARC, jdk 1.6 from Sun, Ant 1.7.1 from OpenCSW. I have no problem running hadoop 0.17.2.1 However, I have problem compiling/integrating hive with the error 'cannot find symbol', although I followed the tutorial. I have the hive source code from SVN exactly from tutorial… >>> More
no namenode error in pseudo-mode

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm new to hadoop and is in learning phase. As per Hadoop Definitve guide, i have set up my hadoop in pseudo distributed mode and everything was working fine. I was even able to execute all the examples from chapter 3 yesterday. Today, when i rebooted my unix and tried to run start-dfs.sh and then… >>> More
Amazon Elastic MapReduce: Exception from FileSystem

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I run my application using ruby client: ruby elastic-mapreduce -j j-20PEKMT9BRSUC --jar s3n://sakae55/lib/edu.cit.som.jar --main-class edu.cit.som.hadoop.SOMDriver --arg s3n://sakae55/repository/input/ecoli/ --arg s3n://sakae55/repository/output/ecoli/pl/ --arg s3n://sakae55/repository/data/ecoli/som… >>> More
Reverse and Forward DNS set up correctly but sometimes MapReduce job fails

as seen on Server Fault - Search for 'Server Fault'
Ever since we switched over our cluster to communicate via private interfaces and created a DNS server with correct forward and reverse lookup zones, we get this message before the M/R job runs: ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormatBase - Cannot resolve the host name for /192.168… >>> More

Related posts about mapreduce

Chaining multiple MapReduce jobs in Hadoop.

as seen on Stack Overflow - Search for 'Stack Overflow'
In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. I.e. Map1 , Reduce1 , Map2 , Reduce2 , etc. So you have the output from the last reduce that is needed as the input for the next map. The intermediate data is something you (in general)… >>> More
Error in using Hadoop MapReduce in Eclipse

as seen on Stack Overflow - Search for 'Stack Overflow'
When I executed a MapReduce program in Eclipse using Hadoop, I got the below error. It has to be some change in path, but I'm not able to figure it out. Any idea? 16:35:39 INFO mapred.JobClient: Task Id : attempt_201001151609_0001_m_000006_0, Status : FAILED java.io.FileNotFoundException: File C:/tmp/hadoop-Shwe/mapred/local/taskTracker/jobcache/job_201001151609_0001/attempt_201001151609_0001_m_000006_0/work/tmp… >>> More
Help converting java program to MapReduce job

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, I would like to convert the following Java program to a MapReduce job. I have read about MapReduce and feel like this would be a good problem to solve using it, but I cannot figure out what to do. This basically loops through a directory of html files and parses them into a CSV file. http://www… >>> More
Can a webserver be implemented using mapreduce?

as seen on Stack Overflow - Search for 'Stack Overflow'
Could mapreduce be used to implement a webserver? I'm thinking something like when a request comes in then the request sits on a queue, until a server is free to process it? Or am I missing the point here? >>> More
Big Data – Buzz Words: What is MapReduce – Day 7 of 21

as seen on SQL Authority - Search for 'SQL Authority'
In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce. What is MapReduce? MapReduce was designed by Google as a programming model for processing large data sets with… >>> More