Search Results

Search found 196 results on 8 pages for 'mapreduce'.

Page 3/8 | < Previous Page | 1 2 3 4 5 6 7 8 | Next Page >

Hadoop WordCount example stuck at map 100% reduce 0%

- by Abhinav Sharma

[hadoop-1.0.2] ? hadoop jar hadoop-examples-1.0.2.jar wordcount /user/abhinav/input /user/abhinav/output Warning: $HADOOP_HOME is deprecated. ****hdfs://localhost:54310/user/abhinav/input 12/04/15 15:52:31 INFO input.FileInputFormat: Total input paths to process : 1 12/04/15 15:52:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/04/15 15:52:31 WARN snappy.LoadSnappy: Snappy native library not loaded 12/04/15 15:52:31 INFO mapred.JobClient: Running job: job_201204151241_0010 12/04/15 15:52:32 INFO mapred.JobClient: map 0% reduce 0% 12/04/15 15:52:46 INFO mapred.JobClient: map 100% reduce 0% I've set up hadoop on a single node using this guide (http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#run-the-mapreduce-job) and I'm trying to run a provided example but I'm getting stuck at map 100% reduce 0%. What could be causing this?

Read the article
How to use Cassandra's Map Reduce with or w/o Pig?

- by UltimateBrent

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end. https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa? There's also mention of Pig making this all easier, but I'm a complete Hadoop noob, so that didn't really help. Your answer can use Thrift or whatever, I just mentioned Pycassa to denote the client side. I'm just trying to understand the difference between what runs in the Cassandra cluster vs. the actual server making the requests.

Read the article
Running Awk command on a cluster

- by alex

How do you execute a Unix shell command (awk script, a pipe etc) on a cluster in parallel (step 1) and collect the results back to a central node (step 2) Hadoop seems to be a huge overkill with its 600k LOC and its performance is terrible (takes minutes just to initialize the job) i don't need shared memory, or - something like MPI/openMP as i dont need to synchronize or share anything, don't need a distributed VM or anything as complex Google's SawZall seems to work only with Google proprietary MapReduce API some distributed shell packages i found failed to compile, but there must be a simple way to run a data-centric batch job on a cluster, something as close as possible to native OS, may be using unix RPC calls i liked rsync simplicity but it seem to update remote notes sequentially, and you cant use it for executing scripts as afar as i know switching to Plan 9 or some other network oriented OS looks like another overkill i'm looking for a simple, distributed way to run awk scripts or similar - as close as possible to data with a minimal initialization overhead, in a nothing-shared, nothing-synchronized fashion Thanks Alex

Read the article
Multiple lines of text to a single map

- by steven

I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already. I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line]. I have tried to set the option and it only takes N lines of input sending it at 1 line at a time to each map: job.setInt("mapred.line.input.format.linespermap", 10); I've found a mailing list recommending me to override LineRecordReader::next, however that is not that simple, as that the internal data members are all private. I've just checked the source for NLineInputFormat and it hard codes LineReader, so overriding will not help. Also, btw I'm using Hadoop 0.18 for compatibility with the Amazon EC2 MapReduce.

Read the article
What's the best way to count unique visitors with Hadoop?

- by beagleguy

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this... DATE siteID action username 05-05-2010 siteA pageview jim 05-05-2010 siteB pageview tom 05-05-2010 siteA pageview jim 05-05-2010 siteB pageview bob 05-05-2010 siteA pageview mike and for each site you wanted to find out the unique visitors for each site? I was thinking the mapper would emit siteID \t username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn't seem right. Anyone have a better way? I'm using python streaming by the way thanks

Read the article
Configuring Hadoop logging to avoid too many log files

- by Eric Wendelin

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: http://stackoverflow.com/questions/2091287/error-in-hadoop-mapreduce My question is: does anyone know how to configure Hadoop to roll the log dir or otherwise prevent this? I'm trying to avoid just setting the "mapred.userlog.retain.hours" and/or "mapred.userlog.limit.kb" properties because I want to actually keep the log files. I was also hoping to configure this in log4j.properties, but looking at the Hadoop 0.20.2 source, it writes directly to logfiles instead of actually using log4j. Perhaps I don't understand how it's using log4j fully. Any suggestions or clarifications would be greatly appreciated.

Read the article
MongoDB map/reduce counts

- by ibz

The output from MongoDB's map/reduce includes something like 'counts': {'input': I, 'emit': E, 'output': O}. I thought I clearly understand what those mean, until I hit a weird case which I can't explain. According to my understanding, counts.input is the number of rows that match the condition (as specified in query). If so, how is it possible that the following two queries have different results? db.mycollection.find({MY_CONDITION}).count() db.mycollection.mapReduce(SOME_MAP, SOME_REDUCE, {'query': {MY_CONDITION}}).counts.input I thought the two should always give the same result, independent of the map and reduce functions, as long as the same condition is used.

Read the article
HBase as a multimap

- by Ibrahim

Hi guys, I'm doing some large scale text processing work and I'm trying to get started with Hadoop and HBase. One of the things I need to do is build a multimap of some stuff, which I later use to look up things and get all items with a certain key (in a M/R job). Would it be OK to use HBase and insert many rows with the same key and rely on versions/timestamps to achieve a multimap-like setup or is this a bad idea? The multimap is built up in the reduce phase of a Mapreduce task by the way, or at least in the way I've formulated it on paper. Thanks! If more information is needed, I'd be happy to provide it. Not sure whether this question is clear.

Read the article
0.20.2 API hadoop version with java 5

- by abdeslam

I have started a maven project trying to implement the MapReduce algorithm in java 1.5.0_14. I have chosen the 0.20.2 API hadoop version. In the pom.xml i'm using thus the following dependency: < dependency < groupId>org.apache.hadoop< /groupId> < artifactId>hadoop-core< /artifactId> < version>0.20.2< /version> < /dependency But when I'm using an import to the org.apache.hadoop classes, I get the following error: bad class file: ${HOME_DIR}\repository\org\apache\hadoop\hadoop-core\0.20.2\hadoop-core-0.20.2.jar(org/apache/hadoop/fs/Path.class) class file has wrong version 50.0, should be 49.0. Does someone know how can I solve this issue. Thanks.

Read the article
Architecture for analysing search result impressions/clicks to improve future searches

- by Hais

We have a large database of items (10m+) stored in MySQL and intend to implement search on metadata on these items, taking advantage of something like Sphinx. The dataset will be changing slightly on a daily basis so Sphinx will be re-indexing daily. However we want the algorithm to self-learn and improve search results by analysing impression and click data so that we provide better results for our customers on that search term, and possibly other similar search terms too. I've been reading up on Hadoop and it seems like it has the potential to crunch all this data, although I'm still unsure how to approach it. Amazon has tutorials for compiling impression vs click data using MapReduce but I can't see how to get this data in a useable format. My idea is that when a search term comes in I query Sphinx to get all the matching items from the dataset, then query the analytics (compiled on an hourly basis or similar) so that we know the most popular items for that search term, then cache the final results using something like Memcached, Membase or similar. Am I along the right lines here?

Read the article
file to map/reduce program

- by vana

Hi , I am working on extracting Parts Of speech (POS) using xml documents and I have a englishPCFG.ser.gz file which is used in extracting POS on xml files. I cannot send this .gz file as input in HDFS directory, but my program uses it for parsing xml files. The file is in my local directory. I am getting "File Not Found" error when I run my mapreduce program. How can i make it available to mapper? I tried placing the file in HDFS where my xml files are present. I also tried adding it in .jar along with class files but not luck. I tried to change the hdfs-default.xml with entry to local directory, still doesnt work. Please let me know how to make mapper read this file? Thank you,

Read the article
hadoop - large database query

- by Mastergeek

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job. From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up: 1) Limit the job to only run 1 mapper that queries the whole table and call it good. or 2) Somehow incorporate an offset in the query so that if Hadoop does try to use a new mapper it won't grab the same stuff. I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure. Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.

Read the article
Getting started with massive data

- by Max

I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from? Hadoop/MapReduce is one obvious start. Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?) I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with? Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale. Any other suggestions on things to learn would be great!

Read the article
Using map/reduce for mapping the properties in a collection

- by And

Update: follow-up to MongoDB Get names of all keys in collection. As pointed out by Kristina, one can use Mongodb 's map/reduce to list the keys in a collection: db.things.insert( { type : ['dog', 'cat'] } ); db.things.insert( { egg : ['cat'] } ); db.things.insert( { type : [] }); db.things.insert( { hello : [] } ); mr = db.runCommand({"mapreduce" : "things", "map" : function() { for (var key in this) { emit(key, null); } }, "reduce" : function(key, stuff) { return null; }}) db[mr.result].distinct("_id") //output: [ "_id", "egg", "hello", "type" ] As long as we want to get only the keys located at the first level of depth, this works fine. However, it will fail retrieving those keys that are located at deeper levels. If we add a new record: db.things.insert({foo: {bar: {baaar: true}}}) And we run again the map-reduce +distinct snippet above, we will get: [ "_id", "egg", "foo", "hello", "type" ] But we will not get the bar and the baaar keys, which are nested down in the data structure. The question is: how do I retrieve all keys, no matter their level of depth? Ideally, I would actually like the script to walk down to all level of depth, producing an output such as: ["_id","egg","foo","foo.bar","foo.bar.baaar","hello","type"] Thank you in advance!

Read the article
Using MongoDB's map/reduce to "group by" two fields

- by ibz

I need something slightly more complex than the examples in the MongoDB docs and I can't seem to be able to wrap my head around it. Say I have a collection of objects of the form {date: "2010-10-10", type: "EVENT_TYPE_1", user_id: 123, ...} Now I want to get something similar to a SQL GROUP BY query, grouping over both date and type. That is, I want the number of events of each type in each day. Also, I'd like to make it unique by user_id, ie. if a user has more events in the same day, count it only once. I'm trying to do this with map/reduce. I do db.logs.mapReduce(function() { emit(this.type, 1); }, function(k, vals) { var total = 0; for (var i = 0; i < vals.length; i++) total += vals[i]; return total; }}) which nicely groups by type, but now, how can I group by date at the same time? Seems the key in emit() can't be an array (I thought about doing emit([this.date, this.type], 1)). Also, how can I ensure the per-user uniqueness? I'm just starting with MongoDB and I'm still having trouble grasping the basic concepts. Also, there is not much documentation available out there. Any help from more experienced users is appreciated. Thanks!

Read the article
Database solution for 200million writes/day, monthly summarization queries

- by sb

Hello. I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.) I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea). I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed). Does anyone have any suggestions as to which databases would be a good fit? P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?

Read the article
Strange results - I obtain same value for all keys

- by Pietro Luciani

I have a problem with mapreduce. Giving as input a list of song ("Songname"#"UserID"#"boolean") i must have as result a song list in which is specified how many time different useres listen them... so a output ("Songname","timelistening"). I used hashtable to allow only one couple . With short files it works well but when I put as input a list about 1000000 of records it returns me the same value (20) for all records. This is my mapper: public static class CanzoniMapper extends Mapper<Object, Text, Text, IntWritable>{ private IntWritable userID = new IntWritable(0); private Text song = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { /*StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }*/ String[] caratteri = value.toString().split("#"); if(caratteri[2].equals("1")){ song.set(caratteri[0]); userID.set(Integer.parseInt(caratteri[1])); context.write(song,userID); } } } This is my reducer: public static class CanzoniReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Hashtable<IntWritable,Text> doppioni = new Hashtable<IntWritable,Text>(); for (IntWritable val : values) { doppioni.put(val,key); } result.set(doppioni.size()); //doppioni.clear(); context.write(key,result); } } and main: Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setJarByClass(Canzoni.class); job.setMapperClass(CanzoniMapper.class); //job.setCombinerClass(CanzoniReducer.class); //job.setNumReduceTasks(2); job.setReducerClass(CanzoniReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); Any idea???

Read the article
Strange Map Reduce Behavior in CouchDB. Rereduce?

- by Tony

I have a mapreduce issue with couchdb (both functions shown below): when I run it with grouplevel = 2 (exact) I get accurate output: {"rows":[ {"key":["2011-01-11","staff-1"],"value":{"total":895.72,"count":2,"services":6,"services_ignored":6,"services_liked":0,"services_disliked":0,"services_disliked_avg":0,"Revise":{"total":275.72,"count":1},"Review":{"total":620,"count":1}}}, {"key":["2011-01-11","staff-2"],"value":{"total":8461.689999999999,"count":2,"services":41,"services_ignored":37,"services_liked":4,"services_disliked":0,"services_disliked_avg":0,"Revise":{"total":4432.4,"count":1},"Review":{"total":4029.29,"count":1}}}, {"key":["2011-01-11","staff-3"],"value":{"total":2100.72,"count":1,"services":10,"services_ignored":4,"services_liked":3,"services_disliked":3,"services_disliked_avg":2.3333333333333335,"Revise":{"total":2100.72,"count":1}}}, However, changing to grouplevel=1 so the values for all the different staff keys should be all grouped by date no longer gives accurate output (notice the total is currect but all others are wrong): {"rows":[ {"key":["2011-01-11"],"value":{"total":11458.130000000001,"count":2,"services":0,"services_ignored":0,"services_liked":0,"services_disliked":0,"services_disliked_avg":0,"None":{"total":11458.130000000001,"count":2}}}, My only theory is this has something to do with rereduce, which I have not yet learned. Should I explore that option or am I missing something else here? This is the Map function: function(doc) { if(doc.doc_type == 'Feedback') { emit([doc.date.split('T')[0], doc.staff_id], doc); } } And this is the Reduce: function(keys, vals) { // sum all key points by status: total, count, services (liked, rejected, ignored) var ret = { 'total':0, 'count':0, 'services': 0, 'services_ignored': 0, 'services_liked': 0, 'services_disliked': 0, 'services_disliked_avg': 0, }; var total_disliked_score = 0; // handle status function handle_status(doc) { if(!doc.status || doc.status == '' || doc.status == undefined) { status = 'None'; } else if (doc.status == 'Declined') { status = 'Rejected'; } else { status = doc.status; } if(!ret[status]) ret[status] = {'total':0, 'count':0}; ret[status]['total'] += doc.total; ret[status]['count'] += 1; }; // handle likes / dislikes function handle_services(services) { ret.services += services.length; for(var a in services) { if (services[a].user_likes == 10) { ret.services_liked += 1; } else if (services[a].user_likes >= 1) { ret.services_disliked += 1; total_disliked_score += services[a].user_likes; if (total_disliked_score >= ret.services_disliked) { ret.services_disliked_avg = total_disliked_score / ret.services_disliked; } } else { ret.services_ignored += 1; } } } // loop thru docs for(var i in vals) { // increment the total $ ret.total += vals[i].total; ret.count += 1; // update totals and sums for the status of this route handle_status(vals[i]); // do the likes / dislikes stats if(vals[i].groups) { for(var ii in vals[i].groups) { if(vals[i].groups[ii].services) { handle_services(vals[i].groups[ii].services); } } } // handle deleted services if(vals[i].hidden_services) { if (vals[i].hidden_services) { handle_services(vals[i].hidden_services); } } } return ret; }

Read the article
Sybase IQ 15.4 annoncé : Sybase parie sur Hadoop et MapReduce, et défie sa maison mère ?

Sybase IQ 15.4 annoncé pour fin novembre Sybase veut repousser les limites du Big Data avec Hadoop et MapReduce Alors que la grand messe annuelle de SAP, le SAPPHIRE NOW, battait son plein, la nouvelle filiale de l'éditeur allemand Sybase a annoncé en totale indépendance la sortie de Sybase IQ 15.4, son serveur analytique haute performance structuré en colonnes pour gérer les "big data". Alors que de son côté SAP met en avant HANA, sa nouvelle technologie de mise en cache des données (ou "In-Memory Computing") pour accélérer la vitesse de traite...

Read the article
Mongo Map Reduce first time

- by James

Hello guys, First time Map/Reduce user here, and using MongoDB. I have a lot of page visit data which I'd like to make some sense of by using Map/Reduce. Below is basically what I want to do, but as a total beginner a Map/Reduce, I think this is above my knowledge! Find all visits to current page where external = true within the last 30 days (unix timestamp, I deal with the date ranges in PHP and then the array, not mongo date) Group all visits by referral location For each referral location, calculate how many then went to visit a page which has a certain word in the [tags]. I'm using the normal Mongo PHP extension if that has an impact.

Read the article
solve a classic map-reduce problem with opencl?

- by liuliu

I am trying to parallel a classic map-reduce problem (which can parallel well with MPI) with OpenCL, namely, the AMD implementation. But the result bothers me. Let me brief about the problem first. There are two type of data that flow into the system: the feature set (30 parameters for each) and the sample set (9000+ dimensions for each). It is a classic map-reduce problem in the sense that I need to calculate the score of every feature on every sample (Map). And then, sum up the overall score for every feature (Reduce). There are around 10k features and 30k samples. I tried different ways to solve the problem. First, I tried to decompose the problem by features. The problem is that the score calculation consists of random memory access (pick some of the 9000+ dimensions and do plus/subtraction calculations). Since I cannot coalesce memory access, it costs. Then, I tried to decompose the problem by samples. The problem is that to sum up overall score, all threads are competing for few score variables. It keeps overwriting the score which turns out to be incorrect. (I cannot carry out individual score first and sum up later because it requires 10k * 30k * 4 bytes). The first method I tried gives me the same performance on i7 860 CPU with 8 threads. However, I don't think the problem is unsolvable: it is remarkably similar to ray tracing problem (for which you carry out calculation that millions of rays against millions of triangles). Any ideas?

Read the article
Map/Reduce on an array of hashes in CouchDB

- by sebastiangeiger

Hello everyone, I am looking for a map/reduce function to calculate the status in a Design Document. Below you can see an example document from my current database. { "_id": "0238f1414f2f95a47266ca43709a6591", "_rev": "22-24a741981b4de71f33cc70c7e5744442", "status": "retrieved image urls", "term": "Lucas Winter", "urls": [ { "status": "retrieved", "url": "http://...." }, { "status": "retrieved", "url": "http://..." } ], "search_depth": 1, "possible_labels": { "gender": "male" }, "couchrest-type": "SearchTerm" } I'd like to get rid of the status key and rather calculate it from the statuses of the urls. My current by_status view looks like the following: function(doc) { if (doc['status']) { emit(doc['status'], null); } } I tried some things but nothing actually works. Right now my Map Function looks like this: function(doc) { if(doc.urls){ emit(doc._id, doc.urls) } } And my Reduce Function function(key, value, rereduce){ var reduced_status = "retrieved" for(var url in value){ if(url.status=="new"){ reduced_status = "new"; } } return reduced_status; } The result is that I get retrieved everywhere which is definitely not right. I tried to narrow down the problem and it seems to be that value is no array, when I use the following Reduce Function I get length 1 everywhere, which is impossible because I have 12 documents in my database, each containing between 20 to 200 urls function(key, value, rereduce){ return value.length; } What am I doing wrong? (I know I want you to write code for me and I'm feeling guilty, but right now I do the calculation of the statuses in ruby after getting the data from the database. It would be nice to already get the right data from the database)

Read the article
Hadoop streaming job : stuck

- by Algorist

Hi, I am running a hadoop streaming job. It got stuck due to no reason. I am not sure how to cancel the task, so that hadoop schedules another task for the same job. I tried killing the job, but it still doesn't work. Anyone know, how to do this? Thank you Bala

Read the article
Where do I start with distributed computing?

- by folone

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with?

Read the article
Unable to run MR on cluster

- by RAVITEJA SATYAVADA

I have an Map reduce program that is running successfully in standalone(Ecllipse) mode but while trying to run the same MR by exporting the jar in cluster. It is showing null pointer exception like this, 13/06/26 05:46:22 ERROR mypackage.HHDriver: Error while configuring run method. java.lang.NullPointerException I double checked the run method parameters those are not null and it is running in standalone mode as well..

Read the article

< Previous Page | 1 2 3 4 5 6 7 8 | Next Page >