Search Results

Search found 196 results on 8 pages for 'mapreduce'.

Page 1/8 | 1 2 3 4 5 6 7 8  | Next Page >

  • Big Data – Buzz Words: What is MapReduce – Day 7 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce. What is MapReduce? MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time. MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programing. The library where procedure Map() and Reduce() belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop which we will explore tomorrow. Advantages of MapReduce Procedures The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node. As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines. How do MapReduce Framework Works? A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers. Map() Procedure There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node. Reduce() Procedure All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node. The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data). The MapReduce Framework has five different steps: Preparing Map() Input Executing User Provided Map() Code Shuffle Map Output to Reduce Processor Executing User Provided Reduce Code Producing the Final Output Here is the Dataflow of MapReduce Framework: Input Reader Map Function Partition Function Compare Function Reduce Function Output Writer In a future blog post of this 31 day series we will explore various components of MapReduce in Detail. MapReduce in a Single Statement MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – HDFS. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Having difficulty in mapreduce to understand

    - by mahesh
    Hi all, I have seen the below link which is of getting started mapreduce with python http://code.google.com/p/appengine-mapreduce/wiki/GettingStartedInPython But still I am not able to understand how its working. I am executing below code but not able to understand what exactly is happening? mapreduce.yaml mapreduce: - name: Testmapper mapper: input_reader: mapreduce.input_readers.DatastoreInputReader handler: main.process params: - name: entity_kind default: main.userDetail mapreduce/main.py #some code class userDetail(db.Model): name = db.StringProperty() #some code def process(u): u.name="mahesh" yield op.db.Put(u) I am executing this and it gives me status = success in status page. But not able to understand what happend The main thing I want do with mapreduce is to search or count records from entity So anyone can please help me?? Thanks in advance

    Read the article

  • MapReduce in DryadLINQ and PLINQ

    - by JoshReuben
    MapReduce See http://en.wikipedia.org/wiki/Mapreduce The MapReduce pattern aims to handle large-scale computations across a cluster of servers, often involving massive amounts of data. "The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The developer expresses the computation as two Func delegates: Map and Reduce. Map - takes a single input pair and produces a set of intermediate key/value pairs. The MapReduce function groups results by key and passes them to the Reduce function. Reduce - accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's Reduce function via an iterator." the canonical MapReduce example: counting word frequency in a text file.     MapReduce using DryadLINQ see http://research.microsoft.com/en-us/projects/dryadlinq/ and http://connect.microsoft.com/Dryad DryadLINQ provides a simple and straightforward way to implement MapReduce operations. This The implementation has two primary components: A Pair structure, which serves as a data container. A MapReduce method, which counts word frequency and returns the top five words. The Pair Structure - Pair has two properties: Word is a string that holds a word or key. Count is an int that holds the word count. The structure also overrides ToString to simplify printing the results. The following example shows the Pair implementation. public struct Pair { private string word; private int count; public Pair(string w, int c) { word = w; count = c; } public int Count { get { return count; } } public string Word { get { return word; } } public override string ToString() { return word + ":" + count.ToString(); } } The MapReduce function  that gets the results. the input data could be partitioned and distributed across the cluster. 1. Creates a DryadTable<LineRecord> object, inputTable, to represent the lines of input text. For partitioned data, use GetPartitionedTable<T> instead of GetTable<T> and pass the method a metadata file. 2. Applies the SelectMany operator to inputTable to transform the collection of lines into collection of words. The String.Split method converts the line into a collection of words. SelectMany concatenates the collections created by Split into a single IQueryable<string> collection named words, which represents all the words in the file. 3. Performs the Map part of the operation by applying GroupBy to the words object. The GroupBy operation groups elements with the same key, which is defined by the selector delegate. This creates a higher order collection, whose elements are groups. In this case, the delegate is an identity function, so the key is the word itself and the operation creates a groups collection that consists of groups of identical words. 4. Performs the Reduce part of the operation by applying Select to groups. This operation reduces the groups of words from Step 3 to an IQueryable<Pair> collection named counts that represents the unique words in the file and how many instances there are of each word. Each key value in groups represents a unique word, so Select creates one Pair object for each unique word. IGrouping.Count returns the number of items in the group, so each Pair object's Count member is set to the number of instances of the word. 5. Applies OrderByDescending to counts. This operation sorts the input collection in descending order of frequency and creates an ordered collection named ordered. 6. Applies Take to ordered to create an IQueryable<Pair> collection named top, which contains the 100 most common words in the input file, and their frequency. Test then uses the Pair object's ToString implementation to print the top one hundred words, and their frequency.   public static IQueryable<Pair> MapReduce( string directory, string fileName, int k) { DryadDataContext ddc = new DryadDataContext("file://" + directory); DryadTable<LineRecord> inputTable = ddc.GetTable<LineRecord>(fileName); IQueryable<string> words = inputTable.SelectMany(x => x.line.Split(' ')); IQueryable<IGrouping<string, string>> groups = words.GroupBy(x => x); IQueryable<Pair> counts = groups.Select(x => new Pair(x.Key, x.Count())); IQueryable<Pair> ordered = counts.OrderByDescending(x => x.Count); IQueryable<Pair> top = ordered.Take(k);   return top; }   To Test: IQueryable<Pair> results = MapReduce(@"c:\DryadData\input", "TestFile.txt", 100); foreach (Pair words in results) Debug.Print(words.ToString());   Note: DryadLINQ applications can use a more compact way to represent the query: return inputTable         .SelectMany(x => x.line.Split(' '))         .GroupBy(x => x)         .Select(x => new Pair(x.Key, x.Count()))         .OrderByDescending(x => x.Count)         .Take(k);     MapReduce using PLINQ The pattern is relevant even for a single multi-core machine, however. We can write our own PLINQ MapReduce in a few lines. the Map function takes a single input value and returns a set of mapped values àLINQ's SelectMany operator. These are then grouped according to an intermediate key à LINQ GroupBy operator. The Reduce function takes each intermediate key and a set of values for that key, and produces any number of outputs per key à LINQ SelectMany again. We can put all of this together to implement MapReduce in PLINQ that returns a ParallelQuery<T> public static ParallelQuery<TResult> MapReduce<TSource, TMapped, TKey, TResult>( this ParallelQuery<TSource> source, Func<TSource, IEnumerable<TMapped>> map, Func<TMapped, TKey> keySelector, Func<IGrouping<TKey, TMapped>, IEnumerable<TResult>> reduce) { return source .SelectMany(map) .GroupBy(keySelector) .SelectMany(reduce); } the map function takes in an input document and outputs all of the words in that document. The grouping phase groups all of the identical words together, such that the reduce phase can then count the words in each group and output a word/count pair for each grouping: var files = Directory.EnumerateFiles(dirPath, "*.txt").AsParallel(); var counts = files.MapReduce( path => File.ReadLines(path).SelectMany(line => line.Split(delimiters)), word => word, group => new[] { new KeyValuePair<string, int>(group.Key, group.Count()) });

    Read the article

  • Hadoop: Iterative MapReduce Performance

    - by S.N
    Is it correct to say that the parallel computation with iterative MapReduce can be justified only when the training data size is too large for the non-parallel computation for the same logic? I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterations is required. I can imagine that the sequential computation is faster than the parallel computation with iterative MapReduce as long as the memory allows to hold a data set in many cases. Is it the only benefit to use the iterative MapReduce? If not, what are the other benefits could be?

    Read the article

  • Need help implementing this algorithm with map Hadoop MapReduce

    - by Julia
    Hi all! i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this. EDIT(i did not clarified i previos version)The data set I have is on a Hadoop cluster, and I should make its MapReduce implementation I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it. So here is shortly pseudo code of my algorithm LIST termList (there is method that creates this list from lucene index) FOLDER topFolder INPUT topFolder IF it is folder and not empty list files (there are 30 sub folders inside) FOR EACH sub folder GET file "CheckedFile.txt" analyze(CheckedFile) ENDFOR END IF Method ANALYZE(CheckedFile) read CheckedFile WHILE CheckedFile has next line GET line FOR(loops through termList) GET third word from line IF third word = term from list append whole line to string buffer ENDIF ENDFOR END WHILE OUTPUT string buffer to file Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs??? I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK! Please please help.

    Read the article

  • MapReduce programming system in java-actionscript

    - by eco_bach
    Just finished reading ch23 in the excellent 'Beautiful Code' http://oreilly.com/catalog/9780596510046 on Distributed Programming with MapReduce. I understand that MapReduce is a programming system designed for large-scale data processing problems, but I have a hard time getting my head around the basic examples given and how I might apply them in real world situations. Can someone give a simple example of MapReduce implemented using either java, javascript or actionscript?

    Read the article

  • MapReduce job is hung after 1 of 5 reducers completed on single-node environment

    - by Marboni
    I have only one Data Node on my dev environment on EC2. I ran heavy MR job and in 6 hours noticed that 100% of mappers and 20% of reducers finished (1 of reducer shows 100% competition, other ones - 0%). Looks like job is hung between 2 reducer runs. I don't see any errors in log files. What it can be? P.S. Last logs of successfully finished reducer: 2012-11-09 11:29:21,576 INFO org.apache.hadoop.mapred.Task: Task:attempt_201211090523_0004_r_000000_0 is done. And is in the process of commiting 2012-11-09 11:29:22,692 INFO org.apache.hadoop.mapred.Task: Task attempt_201211090523_0004_r_000000_0 is allowed to commit now 2012-11-09 11:29:22,719 INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_201211090523_0004_r_000000_0' to /data/output/1352457275873/20121109-053433-common 2012-11-09 11:29:22,721 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201211090523_0004_r_000000_0' done. 2012-11-09 11:29:22,725 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

    Read the article

  • Help converting java program to MapReduce job

    - by BigJoe714
    Hello, I would like to convert the following Java program to a MapReduce job. I have read about MapReduce and feel like this would be a good problem to solve using it, but I cannot figure out what to do. This basically loops through a directory of html files and parses them into a CSV file. http://www.joeharrison.com/bucket/2010/3/Main.java.txt http://www.joeharrison.com/bucket/2010/3/StringParser.java.txt Note: I am not trying to get someone else to do my work for me. I am using this as a learning experience. This will not be used for $$$profit$$$. If I have something that I created converted to a MapReduce job I feel it will provide a much easier way to learn then trying to follow an arbitrary tutorial. I posted this on Rentacoder several days ago, because I wanted to pay someone to convert it for me. But no one on rentacoder seems to know mapreudce!

    Read the article

  • restrict documents for mapreduce with mongoid

    - by theBernd
    I implemented the pearson product correlation via map / reduce / finalize. The missing part is to restrict the documents (representing users) to be processed via a filter query. For a simple query like mapreduce(mapper, reducer, :finalize => finalizer, :query => { :name => 'Bernd' }) I get this to work. But my filter criteria is a little bit more complicated: I have one set of preferences which need to have at least one common element and another set of preferences which may not have a common element. In a later step I also want to restrict this to documents (users) within a certain geographical distance. Currently I have this code working in my map function, but I would prefer to separate this into either query params as supported by mongoid or a javascript function. All my attempts to solve this failed since the code is either ignored or raises an error. I did a couple of tests. A regular find like User.where(:name.in => ['Arno', 'Bernd', 'Claudia']) works and returns #<Mongoid::Criteria:0x00000101f0ea40 @selector={:name=>{"$in"=>["Arno", "Bernd", "Claudia"]}}, @options={}, @klass=User, @documents=[]> Trying the same with mapreduce User.collection. mapreduce(mapper, reducer, :finalize => finalizer, :query => { :name.in => ['Arno', 'Bernd', 'Claudia'] }) fails with `serialize': keys must be strings or symbols (TypeError) in bson-1.1.5 The intermediate query parameter looks like this :query=>{#<Mongoid::Criterion::Complex:0x00000101a209e8 @key=:name, @operator="in">=>["Arno", "Bernd", "Claudia"]} and at least @operator looks a bit weird to me. I'm also uncertain if the class name can be omitted. BTW - I'm using mongodb 1.6.5-x86_64, and the mongoid 2.0.0.beta.20, mongo 1.1.5 and bson 1.1.5 gems on MacOS. What am I doing wrong? Thanks in advance.

    Read the article

  • Chaining multiple MapReduce jobs in Hadoop.

    - by Niels Basjes
    In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. I.e. Map1 , Reduce1 , Map2 , Reduce2 , etc. So you have the output from the last reduce that is needed as the input for the next map. The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs. What is the recommended way of doing that in Hadoop? Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward? Thanks.

    Read the article

  • Disco/MapReduce: Using results of previous iteration as input to new iteration

    - by muckabout
    Currently am implementing PageRank on Disco. As an iterative algorithm, the results of one iteration are used as input to the next iteration. I have a large file which represents all the links, with each row representing a page and the values in the row representing the pages to which it links. For Disco, I break this file into N chunks, then run MapReduce for one round. As a result, I get a set of (page, rank) tuples. I'd like to feed this rank to the next iteration. However, now my mapper needs two inputs: the graph file, and the pageranks. I would like to "zip" together the graph file and the page ranks, such that each line represents a page, it's rank, and it's out links. Since this graph file is separated into N chunks, I need to split the pagerank vector into N parallel chunks, and zip the regions of the pagerank vectors to the graph chunks This all seems more complicated than necessary, and as a pretty straightforward operation (with the quintessential mapreduce algorithm), it seems I'm missing something about Disco that could really simplify the approach. Any thoughts?

    Read the article

  • MongoDB complex MapReduce of video logs

    - by Justin Hourigan
    I have a dataset from video streaming logs. Each video is identified by a FileGUID. The log entries record the FileGUID, the fragment of the video watched and the bandwidth it was watched at. I would like to create a mapreduce outputting, for each video, a count for fragments both total and for each bandwidth. Ideally it would look like; {"FileGUID":"50acb3a5796634df0e073285", { "1":{"total":76, "0832":34, "1028":42}, "2":{"total":42, "0832":28, "1028":14}, ... } } Is this possible with one mapreduce or is it a multi-step process, or should I use a different method? Here is a sample of the data. { "_id": ObjectId("50acb3a5796634df0e073285"), "IP": "46.7.1.88", "DateTime": ISODate("2012-10-24T22:59:57.0Z"), "FileGUID": "8cdde821fb934a6da7c125a012a26612", "Bandwidth": NumberInt(1028), "Segment": NumberInt(1), "Fragment": NumberInt(237), "Status": NumberInt(200), "Size": NumberInt(576790), "UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0" } { "_id": ObjectId("50acb3a5796634df0e073284"), "IP": "46.7.1.88", "DateTime": ISODate("2012-10-24T22:59:52.0Z"), "FileGUID": "8cdde821fb934a6da7c125a012a26612", "Bandwidth": NumberInt(1028), "Segment": NumberInt(1), "Fragment": NumberInt(236), "Status": NumberInt(200), "Size": NumberInt(577100), "UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0" } { "_id": ObjectId("50acb3a5796634df0e073283"), "IP": "46.7.1.88", "DateTime": ISODate("2012-10-24T22:59:47.0Z"), "FileGUID": "8cdde821fb934a6da7c125a012a26612", "Bandwidth": NumberInt(0832), "Segment": NumberInt(1), "Fragment": NumberInt(234), "Status": NumberInt(200), "Size": NumberInt(576664), "UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0" } { "_id": ObjectId("50acb3a5796634df0e073282"), "IP": "46.7.1.88", "DateTime": ISODate("2012-10-24T22:59:42.0Z"), "FileGUID": "8cdde821fb934a6da7c125a012a26612", "Bandwidth": NumberInt(0832), "Segment": NumberInt(1), "Fragment": NumberInt(233), "Status": NumberInt(200), "Size": NumberInt(575692), "UserAgent": "Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko\/20100101 Firefox\/16.0" }

    Read the article

  • Amazon Elastic MapReduce: the number of launched map task

    - by S.N
    Hi, In the "syslog" for a MapReduce job flow step, I see the following: Job Counters Launched reduce tasks=4 Launched map tasks=39 Does the number of launched map tasks include failed tasks? I am using NLineInputFormat class as input format to manage the number of map tasks. However, I get slightly different numbers for exact same input occasionally, or depending on the number of instances (10, 15, and 20). Can anyone tell me why I am seeing different number of tasks launched?

    Read the article

  • Error in using Hadoop MapReduce in Eclipse

    - by Shweta
    When I executed a MapReduce program in Eclipse using Hadoop, I got the below error. It has to be some change in path, but I'm not able to figure it out. Any idea? 16:35:39 INFO mapred.JobClient: Task Id : attempt_201001151609_0001_m_000006_0, Status : FAILED java.io.FileNotFoundException: File C:/tmp/hadoop-Shwe/mapred/local/taskTracker/jobcache/job_201001151609_0001/attempt_201001151609_0001_m_000006_0/work/tmp does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.mapred.TaskRunner.setupWorkDir(TaskRunner.java:519) at org.apache.hadoop.mapred.Child.main(Child.java:155)

    Read the article

  • Implementing PageRank using MapReduce

    - by Nick D.
    Hello, I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce. I have the following simple scenario with three nodes: A B C. The adjacency matrix is here: A { B, C } B { A } The PageRank for B for example is equal to: (1-d)/N + d ( PR(A) / C(A) ) N = number of incoming links to B PR(A) = PageRank of incoming link A C(A) = number of outgoing links from page A I am fine with all the schematics and how the mapper and reducer would work but I cannot get my head around how at the time of calculation by the reducer, C(A) would be known. How will the reducer, when calculating the PageRank of B by aggregating the incoming links to B will know the number of outgoing links from each page. Does this require a lookup in some external data source?

    Read the article

  • Hadoop or Hadoop Streaming for MapReduce on AWS

    - by aeolist
    I'm about to start a mapreduce project which will run on AWS and I am presented with a choice, to either use Java or C++. I understand that writing the project in Java would make more functionality available to me, however C++ could pull it off too, through Hadoop Streaming. Mind you, I have little background in either language. A similar project has been done in C++ and the code is available to me. So my question: is this extra functionality available through AWS or is it only relevant if you have more control over the cloud? Is there anything else I should bear in mind in order to make a decision, like availability of plugins for hadoop that work better with one language or the other? Thanks in advance

    Read the article

  • Amazon Elastic MapReduce: Exception from FileSystem

    - by S.N
    Hi, I run my application using ruby client: ruby elastic-mapreduce -j j-20PEKMT9BRSUC --jar s3n://sakae55/lib/edu.cit.som.jar --main-class edu.cit.som.hadoop.SOMDriver --arg s3n://sakae55/repository/input/ecoli/ --arg s3n://sakae55/repository/output/ecoli/pl/ --arg s3n://sakae55/repository/data/ecoli/som.txt Then, I am seeing the following error: java.lang.IllegalArgumentException: This file system object (file:///) does not support access to the request path 'hdfs://i -10-195-207-230.ec2.internal:9000/mnt/var/lib/hadoop/tmp/mapred/system/job_201004221221_0017/job.jar' You possibly called Fi eSystem.get(conf) when you should of called FileSystem.get(uri, conf) to obtain a file system supporting your path. at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:52) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:416) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:259) at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:676) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:200) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1184) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1160) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1132) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:662) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:729) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026) at edu.cit.som.hadoop.SOMDriver.runIteration(SOMDriver.java:106) at edu.cit.som.hadoop.SOMDriver.train(SOMDriver.java:69) at edu.cit.som.hadoop.SOMDriver.run(SOMDriver.java:52) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at edu.cit.som.hadoop.SOMDriver.main(SOMDriver.java:36) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) I am not sure why the error references to "file:///" even though all the arguments I pass do not use the schema.

    Read the article

  • MongoDB: What's the point of using MapReduce without parallelism?

    - by netvope
    Quoting http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Parallelism As of right now, MapReduce jobs on a single mongod process are single threaded Without parallelism, what are the benefits of MapReduce compared to simpler or more traditional methods for queries and data aggregation? To avoid confusion: the question is NOT "what are the benefits of document-oriented DB over traditional relational DB"

    Read the article

  • Project Idea with Hadoop MapReduce

    - by Aditya Andhalikar
    Hello, I learnt Hadoop a few months back and managed to do a very introductory programming project on it. I want to do a small - medium sized project or series of small programming assignments with Hadoop. I have seen lot of ideas around but I dont see anything that can be finished in about 60-70 hours of work so a pretty small scale project as I want to do that in my spare time along with other studies. Most project ideas I have seen sort of large to go on for 2-3 months. My main objective out of this exercise to develop good expertise in programming with Hadoop environment not to do any research or solve specific problems. I see Hadoop being used lot of with webservices maybe that would be an interesting track for small projects. Thank you in advance. Regards, Aditya

    Read the article

  • MapReduce for counting parameter values

    - by cnkt
    I have document like this: { "_id": ObjectId("4d17c7963ffcf60c1100002f"), "title": "Text", "params": { "brand": "BMW", "model": "i3" } } { "_id": ObjectId("4d17c7963ffcf60c1100002f"), "title": "Text", "params": { "brand": "BMW", "model": "i5" } } What i need is the count of every params values. like: brand --------- BMW (2) model --------- i3 (1) i5 (1) I think i have to write map/reduce functions. How can i do this? Thanks.

    Read the article

  • Hadoop/MapReduce: Reading and writing classes generated from DDL

    - by Dave
    Hi, Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL? I have defined some struct-like records using DDL. For example: class Customer { ustring FirstName; ustring LastName; ustring CardNo; long LastPurchase; } I've compiled this to get a Customer class and included it into my project. I can easily see how to use this as input and output for mappers and reducers (the generated class implements Writable), but not how to read and write it to file. The JavaDoc for the org.apache.hadoop.record package talks about serializing these records in Binary, CSV or XML format. How do I actually do that? Say my reducer produces IntWritable keys and Customer values. What OutputFormat do I use to write the result in CSV format? What InputFormat would I use to read the resulting files in later, if I wanted to perform analysis over them?

    Read the article

  • Java MapReduce read data

    - by Tatiana
    Hi I am having following map-reduce code by which I am trying to read records from my database. There's code: import java.io.*; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.db.DBConfiguration; import org.apache.hadoop.mapred.lib.db.DBInputFormat; import org.apache.hadoop.mapred.lib.db.DBWritable; import org.apache.hadoop.util.*; import org.apache.hadoop.conf.*; public class Connection extends Configured implements Tool { public int run(String[] args) throws IOException { JobConf conf = new JobConf(getConf(), Connection.class); conf.setInputFormat(DBInputFormat.class); DBConfiguration.configureDB(conf, "com.sun.java.util.jar.pack.Driver", "jdbc:postgresql://localhost:5432/polyclinic", "postgres", "12345"); String[] fields = { "name" }; DBInputFormat.setInput(conf, MyRecord.class, "doctors", null, null, fields); conf.setMapOutputKeyClass(LongWritable.class); conf.setMapOutputValueClass(MyRecord.class); conf.setOutputKeyClass(LongWritable.class); conf.setOutputValueClass(TextOutputFormat.class); TextOutputFormat.setOutputPath(conf, new Path(args[0])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Connection(), args); System.exit(exitCode); } } Class Mapper: import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; public class MyMapper extends MapReduceBase implements Mapper<LongWritable, MyRecord, Text, IntWritable> { public void map(LongWritable key, MyRecord val, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { output.collect(new Text(val.name), new IntWritable(1)); } } Class Record: import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapred.lib.db.DBWritable; class MyRecord implements Writable, DBWritable { String name; public void readFields(DataInput in) throws IOException { this.name = Text.readString(in); } public void readFields(ResultSet resultSet) throws SQLException { this.name = resultSet.getString(1); } public void write(DataOutput out) throws IOException { } public void write(PreparedStatement stmt) throws SQLException { } } After this I got error: WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). Can you give me any suggestion how to solve this problem?

    Read the article

1 2 3 4 5 6 7 8  | Next Page >