Search Results

Search found 196 results on 8 pages for 'mapreduce'.

Page 5/8 | < Previous Page | 1 2 3 4 5 6 7 8  | Next Page >

  • How to pick random (small) data samples using Map/Reduce?

    - by Andrei Savu
    I want to write a map/reduce job to select a number of random samples from a large dataset based on a row level condition. I want to minimize the number of intermediate keys. Pseudocode: for each row if row matches condition put the row.id in the bucket if the bucket is not already large enough Have you done something like this? Is there any well known algorithm? A sample containing sequential rows is also good enough. Thanks.

    Read the article

  • How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

    - by 100gods
    I want to calculate Centered Moving average of a set of Data . Example Input format : quarter | sales Q1'11 | 9 Q2'11 | 8 Q3'11 | 9 Q4'11 | 12 Q1'12 | 9 Q2'12 | 12 Q3'12 | 9 Q4'12 | 10 Mathematical Representation of data and calculation of Moving average and then centered moving average Period Value MA Centered 1 9 1.5 2 8 2.5 9.5 3 9 9.5 3.5 9.5 4 12 10.0 4.5 10.5 5 9 10.750 5.5 11.0 6 12 6.5 7 9 I am stuck with the implementation of RecordReader which will provide mapper sales value of a year i.e. of four quarter. The RecordReader Problem Question Thread Thanks

    Read the article

  • Parallelizing a serial algorithm

    - by user643813
    Hej folks, I am working on porting a Text mining/Natural language application from single-core to a Map-Reduce style system. One of the steps involves a while loop similar to this: Queue<Element>; while (!queue.empty()) { Element e = queue.next(); Set<Element> result = calculateResultSet(e); if (!result.empty()) { queue.addAll(result); } } Each iteration depends on the result of the one before (kind of). There is no way of determining the number of iterations this loop will have to perform. Is there a way of parallelizing a serial algorithm such as this one? I am trying to think of a feedback mechanism, that is able to provide its own input, but how would one go about parallelizing it? Thanks for any help/remarks

    Read the article

  • Can you use MongoDB map/reduce to migrate data?

    - by Brian Armstrong
    I have a large collection where I want to modify all the documents by populating a field. A simple example might be caching the comment count on each post: class Post field :comment_count, type: Integer has_many :comments end class Comment belongs_to :post end I can run it in serial with something like: Post.all.each do |p| p.udpate_attribute :comment_count, p.comments.count end But it's taking 24 hours to run (large collection). I was wondering if mongo's map/reduce could be used for this? But I haven't seen a great example yet. I imagine you would map off the comments collection and then store the reduced results in the posts collection. Am I on the right track?

    Read the article

  • Error when running a basic Hadoop code

    - by Abhishek Shivkumar
    I am running a hadoop code that has a partitioner class inside the job. But, when I run the command hadoop jar Sort.jar SecondarySort inputdir outputdir I am getting a runtime error that says class KeyPartitioner not org.apache.hadoop.mapred.Partitioner. I have ensured that the KeyPartitioner class has extended the Partitioner class, but why am I getting this error? Here is the driver code: JobConf conf = new JobConf(getConf(), SecondarySort.class); conf.setJobName(SecondarySort.class.getName()); conf.setJarByClass(SecondarySort.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setMapOutputKeyClass(StockKey.class); conf.setMapOutputValueClass(Text.class); conf.setPartitionerClass((Class<? extends Partitioner<StockKey, DoubleWritable>>) KeyPartitioner.class); conf.setMapperClass((Class<? extends Mapper<LongWritable, Text, StockKey, DoubleWritable>>) StockMapper.class); conf.setReducerClass((Class<? extends Reducer<StockKey, DoubleWritable, Text, Text>>) StockReducer.class); and here is the code of the partitioner class: public class KeyPartitioner extends Partitioner<StockKey, Text> { @Override public int getPartition(StockKey arg0, Text arg1, int arg2) { int partition = arg0.name.hashCode() % arg2; return partition; } }

    Read the article

  • Best way to do one-to-many "JOIN" in CouchDB

    - by mit
    There are CouchDB documents that are list elements: { "type" : "el", "id" : "1", "content" : "first" } { "type" : "el", "id" : "2", "content" : "second" } { "type" : "el", "id" : "3", "content" : "third" } There is one document that defines the list: { "type" : "list", "elements" : ["2","1"] , "id" : "abc123" } As you can see the third element was deleted, it is no longer part of the list. So it must not be part of the result. Now I want a view that returns the content elements including the right order. The result could be: { "content" : ["second", "first"] } In this case the order of the elements is already as it should be. Another possible result: { "content" : [{"content" : "first", "order" : 2},{"content" : "second", "order" : 1}] } I started writing the map function: map = function (doc) { if (doc.type === 'el') { emit(doc.id, {"content" : doc.content}); //emit the id and the content exit; } if (doc.type === 'list') { for ( var i=0, l=doc.elements.length; i<l; ++i ){ emit(doc.elements[i], { "order" : i }); //emit the id and the order } } } This is as far as I can get. Can you correct my mistakes and write a reduce function? Remember that the third document must not be part of the result. Of course you can write a different map function also. But the structure of the documents (one definig element document and an entry document for each entry) cannot be changed.

    Read the article

  • How do I control output files name and content of an Hadoop streaming job?

    - by Eran Kampf
    Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key. Update: Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.htmlhttp://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html I havent seen any samples for this out there... Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?

    Read the article

  • Hadoop Map Reduce job never finishes

    - by rohanbk
    I am running a Hadoop Map Reduce job using a Python Mapper and Reducer script, and Hadoop Streaming. Both my Map and Reduce jobs run till they are both 100%, but the job doesn't end. I know that when things go sour, Hadoop will terminate the job, but in this case, both stages reach a 100% and just never end. Has anyone else encountered anything similar? Also, how do I debug my program to figure out where things are going wrong? If I use a smaller input file, and I just run something like: $> cat input_file | mapper.py | sort | reduce.py >> output_file everything works perfectly fine. However, when I use Hadoop, things don't work out.

    Read the article

  • How to easily apply a function to a collection in C++

    - by Jesse Beder
    I'm storing images as arrays, templated based on the type of their elements, like Image<unsigned> or Image<float>, etc. Frequently, I need to perform operations on these images; for example, I might need to add two images, or square an image (elementwise), and so on. All of the operations are elementwise. I'd like get as close as possible to writing things like: float Add(float a, float b) { return a+b; } Image<float> result = Add(img1, img2); and even better, things like complex ComplexCombine(float a, float b) { return complex(a, b); } Image<complex> result = ComplexCombine(img1, img2); or struct FindMax { unsigned currentMax; FindMax(): currentMax(0) {} void operator(unsigned a) { if(a > currentMax) currentMax = a; } }; FindMax findMax; findMax(img); findMax.currentMax; // now contains the maximum value of 'img' Now, I obviously can't exactly do that; I've written something so that I can call: Image<float> result = Apply(img1, img2, Add); but I can't seem to figure out a generic way for it to detect the return type of the function/function object passed, so my ComplexCombine example above is out; also, I have to write a new one for each number of arguments I'd like to pass (which seems inevitable). Any thoughts on how to achieve this (with as little boilerplate code as possible)?

    Read the article

  • CouchDB- basic grouping question

    - by dnolen
    I have a user document which has a group field. This field is an array of group ids. I would like to write a view that returns (groupid as key) - (array of user docs as val). This mapping operation seems like a good beginning. function(doc) { var type = doc.type; var groups = doc.groups; if(type == "user" && groups.length > 0) { for(var i = 0; i < groups.length; i++) { emit(groups[i], doc); } } } But there's obviously something very wrong with my attempt at a reduce: function(key, values, rereduce) { var set = []; var seen = []; for(var i = 0; i < values.length; i++) { var _id = values[i]._id; if(seen.indexOf(_id) == -1) { seen.push(_id); set.push(values[i]); } } return set; } I'm running CouchDB 0.10dev. Any help appreciated.

    Read the article

  • What are optimal strategies for using mapreduce and other applications on the same server?

    - by user45532
    I have two applications that I need to run continuously to process data. 1.) An app that processes and aggregates information from sources 2.) A mapreduce workflow* that processes the above info I've thought about either getting vps hosting or getting my own inexpensive server and using xen to split the resources of the server. Getting a quad core box with 2 GB of Ram seems a lot cheaper than the grid options I've seen at slicehost, rackspace and others...

    Read the article

  • Oracle Big Data Software Downloads

    - by Mike.Hallett(at)Oracle-BI&EPM
    Companies have been making business decisions for decades based on transactional data stored in relational databases. Beyond that critical data, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Oracle offers a broad integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data Connectors Downloads here, includes: Oracle SQL Connector for Hadoop Distributed File System Release 2.1.0 Oracle Loader for Hadoop Release 2.1.0 Oracle Data Integrator Companion 11g Oracle R Connector for Hadoop v 2.1 Oracle Big Data Documentation The Oracle Big Data solution offers an integrated portfolio of products to help you organize and analyze your diverse data sources alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data, Release 2.2.0 - E41604_01 zip (27.4 MB) Integrated Software and Big Data Connectors User's Guide HTML PDF Oracle Data Integrator (ODI) Application Adapter for Hadoop Apache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases. Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Designing and implementing a MapReduce job usually requires expert programming knowledge. However, when you use Oracle Data Integrator with the Application Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs. Employing familiar and easy-to-use tools and pre-configured knowledge modules (KMs), the application adapter provides the following capabilities: Loading data into Hadoop from the local file system and HDFS Performing validation and transformation of data within Hadoop Loading processed data from Hadoop to an Oracle database for further processing and generating reports Oracle Database Loader for Hadoop Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. It pre-partitions the data if necessary and transforms it into a database-ready format. Oracle Loader for Hadoop is a Java MapReduce application that balances the data across reducers to help maximize performance. Oracle R Connector for Hadoop Oracle R Connector for Hadoop is a collection of R packages that provide: Interfaces to work with Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables Predictive analytic techniques, written in R or Java as Hadoop MapReduce jobs, that can be applied to data in HDFS files You install and load this package as you would any other R package. Using simple R functions, you can perform tasks such as: Access and transform HDFS data using a Hive-enabled transparency layer Use the R language for writing mappers and reducers Copy data between R memory, the local file system, HDFS, Hive, and Oracle databases Schedule R programs to execute as Hadoop MapReduce jobs and return the results to any of those locations Oracle SQL Connector for Hadoop Distributed File System Using Oracle SQL Connector for HDFS, you can use an Oracle Database to access and analyze data residing in Hadoop in these formats: Data Pump files in HDFS Delimited text files in HDFS Hive tables For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS. Oracle SQL Connector for HDFS uses external tables to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. Related Documentation Cloudera's Distribution Including Apache Hadoop Library HTML Oracle R Enterprise HTML Oracle NoSQL Database HTML Recent Blog Posts Big Data Appliance vs. DIY Price Comparison Big Data: Architecture Overview Big Data: Achieve the Impossible in Real-Time Big Data: Vertical Behavioral Analytics Big Data: In-Memory MapReduce Flume and Hive for Log Analytics Building Workflows in Oozie

    Read the article

  • How to setup Hadoop cluster so that it accepts mapreduce jobs from remote computers?

    - by drasto
    There is a computer I use for Hadoop map/reduce testing. This computer runs 4 Linux virtual machines (using Oracle virtual box). Each of them has Cloudera with Hadoop (distribution c3u4) installed and serves as a node of Hadoop cluster. One of those 4 nodes is master node running namenode and jobtracker, others are slave nodes. Normally I use this cluster from local network for testing. However when I try to access it from another network I cannot send any jobs to it. The computer running Hadoop cluster has public IP and can be reached over internet for another services. For example I am able to get HDFS (namenode) administration site and map/reduce (jobtracker) administration site (on ports 50070 and 50030 respectively) from remote network. Also it is possible to use Hue. Ports 8020 and 8021 are both allowed. What is blocking my map/reduce job submits from reaching the cluster? Is there some setting that I must change first in order to be able to submit map/reduce jobs remotely? Here is my mapred-site.xml file: <configuration> <property> <name>mapred.job.tracker</name> <value>master:8021</value> </property> <!-- Enable Hue plugins --> <property> <name>mapred.jobtracker.plugins</name> <value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value> <description>Comma-separated list of jobtracker plug-ins to be activated. </description> </property> <property> <name>jobtracker.thrift.address</name> <value>0.0.0.0:9290</value> </property> </configuration> And this is in /etc/hosts file: 192.168.1.15 master 192.168.1.14 slave1 192.168.1.13 slave2 192.168.1.9 slave3

    Read the article

  • Live from ODTUG - Big Data and SQL session #2

    - by Jean-Pierre Dijcks
    Sitting in Dominic Delmolino's session at ODTUG (KScope 12). If the session count at conferences is any indication then we will see more and more people start to deploy MapReduce in the database. And yes, that would be with SQL and PL/SQL first and foremost. Both Dominic and our own Bryn Llewellyn are doing MapReduce in the database presentations.  Since I have seen both, I would advice people to first look through Dominic's session to get a good grasp on what mappers do and what reducers do, then dive into Bryn's for a bunch of PL/SQL example. The thing I like about Dominic's is the last slide (a recursive WITH statement) to do this in SQL... Now I am hoping that next year we will see tools vendors show off how they work with Hadoop and MapReduce (at least talking about the concepts!!).

    Read the article

  • Most useful parallel programming algorithm?

    - by Zubair
    I recenty asked a question about parallel programming algorithms which was closed quite fast due to my bad ability to communicate my intent: http://stackoverflow.com/questions/2407631/what-is-the-most-useful-parallel-programming-algorithm-closed I had also recently asked another question, specifically: http://stackoverflow.com/questions/2407493/is-mapreduce-such-a-generalisation-of-another-programming-principle/2407570#2407570 The other question was specifically about map reduce and to see if mapreduce was a more specific version of some other concept in parallel programming. This question (about a useful parallel programming algorithm) is more about the whole series of algorithms for parallel programming. You will have to excuse me though as I am quite new to parallel programming, so maybe MapReduce or something that is a more general form of mapreduce is the "only" parallel programming construct which is available, in which case I apologise for my ignorance

    Read the article

  • How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

    - by Orgad Kimchi
    Oracle Technology Network (OTN) published the "How to Set Up a Hadoop Cluster Using Oracle Solaris" OOW 2013 Hands-On Lab. This hands-on lab presents exercises that demonstrate how to set up an Apache Hadoop cluster using Oracle Solaris 11 technologies such as Oracle Solaris Zones, ZFS, and network virtualization. Key topics include the Hadoop Distributed File System (HDFS) and the Hadoop MapReduce programming model. We will also cover the Hadoop installation process and the cluster building blocks: NameNode, a secondary NameNode, and DataNodes. In addition, you will see how you can combine the Oracle Solaris 11 technologies for better scalability and data security, and you will learn how to load data into the Hadoop cluster and run a MapReduce job. Summary of Lab Exercises This hands-on lab consists of 13 exercises covering various Oracle Solaris and Apache Hadoop technologies:     Install Hadoop.     Edit the Hadoop configuration files.     Configure the Network Time Protocol.     Create the virtual network interfaces (VNICs).     Create the NameNode and the secondary NameNode zones.     Set up the DataNode zones.     Configure the NameNode.     Set up SSH.     Format HDFS from the NameNode.     Start the Hadoop cluster.     Run a MapReduce job.     Secure data at rest using ZFS encryption.     Use Oracle Solaris DTrace for performance monitoring.  Read it now

    Read the article

  • SQL analytical mash-ups deliver real-time WOW! for big data

    - by KLaker
    One of the overlooked capabilities of SQL as an analysis engine, because we all just take it for granted, is that you can mix and match analytical features to create some amazing mash-ups. As we move into the exciting world of big data these mash-ups can really deliver those "wow, I never knew that" moments. While Java is an incredibly flexible and powerful framework for managing big data there are some significant challenges in using Java and MapReduce to drive your analysis to create these "wow" discoveries. One of these "wow" moments was demonstrated at this year's OpenWorld during Andy Mendelsohn's general keynote session.  Here is the scenario - we are looking for fraudulent activities in our big data stream and in this case we identifying potentially fraudulent activities by looking for specific patterns. We using geospatial tagging of each transaction so we can create a real-time fraud-map for our business users. Where we start to move towards a "wow" moment is to extend this basic use of spatial and pattern matching, as shown in the above dashboard screen, to incorporate spatial analytics within the SQL pattern matching clause. This will allow us to compute the distance between transactions. Apologies for the quality of this screenshot….hopefully below you see where we have extended our SQL pattern matching clause to use location of each transaction and to calculate the distance between each transaction: This allows us to compare the time of the last transaction with the time of the current transaction and see if the distance between the two points is possible given the time frame. Obviously if I buy something in Florida from my favourite bike store (may be a new carbon saddle for my Trek) and then 5 minutes later the system sees my credit card details being used in Arizona there is high probability that this transaction in Arizona is actually fraudulent (I am fast on my Trek but not that fast!) and we can flag this up in real-time on our dashboard: In this post I have used the term "real-time" a couple of times and this is an important point and one of the key reasons why SQL really is the only language to use if you want to analyse  big data. One of the most important questions that comes up in every big data project is: how do we do analysis? Many enlightened customers are now realising that using Java-MapReduce to deliver analysis does not result in "wow" moments. These "wow" moments only come with SQL because it is offers a much richer environment, it is simpler to use and it is faster - which makes it possible to deliver real-time "Wow!". Below is a slide from Andy's session showing the results of a comparison of Java-MapReduce vs. SQL pattern matching to deliver our "wow" moment during our live demo.  You can watch our analytical mash-up "Wow" demo that compares the power of 12c SQL pattern matching + spatial analytics vs. Java-MapReduce  here: You can get more information about SQL Pattern Matching on our SQL Analytics home page on OTN, see here http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html.  You can get more information about our spatial analytics here: http://www.oracle.com/technetwork/database-options/spatialandgraph/overview/index.html If you would like to watch the full Database 12c OOW presentation see here: http://medianetwork.oracle.com/video/player/2686974264001

    Read the article

  • MapRedux - PowerShell and Big Data

    - by Dittenhafer Solutions
    MapRedux – #PowerShell and #Big Data Have you been hearing about “big data”, “map reduce” and other large scale computing terms over the past couple of years and been curious to dig into more detail? Have you read some of the Apache Hadoop online documentation and unfortunately concluded that it wasn't feasible to setup a “test” hadoop environment on your machine? More recently, I have read about some of Microsoft’s work to enable Hadoop on the Azure cloud. Being a "Microsoft"-leaning technologist, I am more inclinded to be successful with experimentation when on the Windows platform. Of course, it is not that I am "religious" about one set of technologies other another, but rather more experienced. Anyway, within the past couple of weeks I have been thinking about PowerShell a bit more as the 2012 PowerShell Scripting Games approach and it occured to me that PowerShell's support for Windows Remote Management (WinRM), and some other inherent features of PowerShell might lend themselves particularly well to a simple implementation of the MapReduce framework. I fired up my PowerShell ISE and started writing just to see where it would take me. Quite simply, the ScriptBlock feature combined with the ability of Invoke-Command to create remote jobs on networked servers provides much of the plumbing of a distributed computing environment. There are some limiting factors of course. Microsoft provided some default settings which prevent PowerShell from taking over a network without administrative approval first. But even with just one adjustment, a given Windows-based machine can become a node in a MapReduce-style distributed computing environment. Ok, so enough introduction. Let's talk about the code. First, any machine that will participate as a remote "node" will need WinRM enabled for remote access, as shown below. This is not exactly practical for hundreds of intended nodes, but for one (or five) machines in a test environment it does just fine. C:> winrm quickconfig WinRM is not set up to receive requests on this machine. The following changes must be made: Set the WinRM service type to auto start. Start the WinRM service. Make these changes [y/n]? y Alternatively, you could take the approach described in the Remotely enable PSRemoting post from the TechNet forum and use PowerShell to create remote scheduled tasks that will call Enable-PSRemoting on each intended node. Invoke-MapRedux Moving on, now that you have one or more remote "nodes" enabled, you can consider the actual Map and Reduce algorithms. Consider the following snippet: $MyMrResults = Invoke-MapRedux -MapReduceItem $Mr -ComputerName $MyNodes -DataSet $dataset -Verbose Invoke-MapRedux takes an instance of a MapReduceItem which references the Map and Reduce scriptblocks, an array of computer names which are the remote nodes, and the initial data set to be processed. As simple as that, you can start working with concepts of big data and the MapReduce paradigm. Now, how did we get there? I have published the initial version of my PsMapRedux PowerShell Module on GitHub. The PsMapRedux module provides the Invoke-MapRedux function described above. Feel free to browse the underlying code and even contribute to the project! In a later post, I plan to show some of the inner workings of the module, but for now let's move on to how the Map and Reduce functions are defined. Map Both the Map and Reduce functions need to follow a prescribed prototype. The prototype for a Map function in the MapRedux module is as follows. A simple scriptblock that takes one PsObject parameter and returns a hashtable. It is important to note that the PsObject $dataset parameter is a MapRedux custom object that has a "Data" property which offers an array of data to be processed by the Map function. $aMap = { Param ( [PsObject] $dataset ) # Indicate the job is running on the remote node. Write-Host ($env:computername + "::Map"); # The hashtable to return $list = @{}; # ... Perform the mapping work and prepare the $list hashtable result with your custom PSObject... # ... The $dataset has a single 'Data' property which contains an array of data rows # which is a subset of the originally submitted data set. # Return the hashtable (Key, PSObject) Write-Output $list; } Reduce Likewise, with the Reduce function a simple prototype must be followed which takes a $key and a result $dataset from the MapRedux's partitioning function (which joins the Map results by key). Again, the $dataset is a MapRedux custom object that has a "Data" property as described in the Map section. $aReduce = { Param ( [object] $key, [PSObject] $dataset ) Write-Host ($env:computername + "::Reduce - Count: " + $dataset.Data.Count) # The hashtable to return $redux = @{}; # Return Write-Output $redux; } All Together Now When everything is put together in a short example script, you implement your Map and Reduce functions, query for some starting data, build the MapReduxItem via New-MapReduxItem and call Invoke-MapRedux to get the process started: # Import the MapRedux and SQL Server providers Import-Module "MapRedux" Import-Module “sqlps” -DisableNameChecking # Query the database for a dataset Set-Location SQLSERVER:\sql\dbserver1\default\databases\myDb $query = "SELECT MyKey, Date, Value1 FROM BigData ORDER BY MyKey"; Write-Host "Query: $query" $dataset = Invoke-SqlCmd -query $query # Build the Map function $MyMap = { Param ( [PsObject] $dataset ) Write-Host ($env:computername + "::Map"); $list = @{}; foreach($row in $dataset.Data) { # Write-Host ("Key: " + $row.MyKey.ToString()); if($list.ContainsKey($row.MyKey) -eq $true) { $s = $list.Item($row.MyKey); $s.Sum += $row.Value1; $s.Count++; } else { $s = New-Object PSObject; $s | Add-Member -Type NoteProperty -Name MyKey -Value $row.MyKey; $s | Add-Member -type NoteProperty -Name Sum -Value $row.Value1; $list.Add($row.MyKey, $s); } } Write-Output $list; } $MyReduce = { Param ( [object] $key, [PSObject] $dataset ) Write-Host ($env:computername + "::Reduce - Count: " + $dataset.Data.Count) $redux = @{}; $count = 0; foreach($s in $dataset.Data) { $sum += $s.Sum; $count += 1; } # Reduce $redux.Add($s.MyKey, $sum / $count); # Return Write-Output $redux; } # Create the item data $Mr = New-MapReduxItem "My Test MapReduce Job" $MyMap $MyReduce # Array of processing nodes... $MyNodes = ("node1", "node2", "node3", "node4", "localhost") # Run the Map Reduce routine... $MyMrResults = Invoke-MapRedux -MapReduceItem $Mr -ComputerName $MyNodes -DataSet $dataset -Verbose # Show the results Set-Location C:\ $MyMrResults | Out-GridView Conclusion I hope you have seen through this article that PowerShell has a significant infrastructure available for distributed computing. While it does take some code to expose a MapReduce-style framework, much of the work is already done and PowerShell could prove to be the the easiest platform to develop and run big data jobs in your corporate data center, potentially in the Azure cloud, or certainly as an academic excerise at home or school. Follow me on Twitter to stay up to date on the continuing progress of my Powershell MapRedux module, and thanks for reading! Daniel

    Read the article

  • Big Data – Buzz Words: What is Hadoop – Day 6 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned what is NoSQL. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – Hadoop. What is Hadoop? Apache Hadoop is an open-source, free and Java based software framework offers a powerful distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs applications on large clusters of commodity hardware and it processes thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and high availability. What are the core components of Hadoop? There are two major components of the Hadoop framework and both fo them does two of the important task for it. Hadoop MapReduce is the method to split a larger data problem into smaller chunk and distribute it to many different commodity servers. Each server have their own set of resources and they have processed them locally. Once the commodity server has processed the data they send it back collectively to main server. This is effectively a process where we process large data effectively and efficiently. (We will understand this in tomorrow’s blog post). Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference between any other file system and Hadoop. When we move a file on HDFS, it is automatically split into many small pieces. These small chunks of the file are replicated and stored on other servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day after tomorrow’s blog post). Besides above two core components Hadoop project also contains following modules as well. Hadoop Common: Common utilities for the other Hadoop modules Hadoop Yarn: A framework for job scheduling and cluster resource management There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will gradually explore in later blog posts. A Multi-node Hadoop Cluster Architecture Now let us quickly see the architecture of the a multi-node Hadoop cluster. A small Hadoop cluster includes a single master node and multiple worker or slave node. As discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and another is of HDFC Layer. Each of these layer have its own relevant component. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or compute node. The matter of the fact that is the key feature of the Hadoop. In this introductory blog post we will stop here while describing the architecture of Hadoop. In a future blog post of this 31 day series we will explore various components of Hadoop Architecture in Detail. Why Use Hadoop? There are many advantages of using Hadoop. Let me quickly list them over here: Robust and Scalable – We can add new nodes as needed as well modify them. Affordable and Cost Effective – We do not need any special hardware for running Hadoop. We can just use commodity server. Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured and unstructured data. Highly Available and Fault Tolerant – When a node fails, the Hadoop framework automatically fails over to another node. Why Hadoop is named as Hadoop? In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo. Doug Cutting named Hadoop after his son’s toy elephant. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – MapReduce. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Hadoop:Only master node does the work

    - by user287722
    I've setup a Hadoop 2.2 cluster with 1 master node(namenode and secondary namenode) and 3 slave nodes(datanode and namenode on each one).All of the machines use Linux Mint 64bit. When I run my MapReduce program, writen in Java, I can only see that master node is using extra CPU and RAM. Slave nodes are not doing a thing. I've checked the logs from all of the namenodes and there is nothing wrong with the namenodes on slave nodes. Resource Manager is running and all of the slave nodes can see the Resource Manager. I used this http://n0where.net/hadoop-2-2-multi-node-cluster-setup/ tutorial to configure my nodes. Datanodes are working in terms of distributed data storing but I can't see any indication of distributed data processing. Do I have to configure the xml configuration files in some other way so all of the machines will process data while I'm running my MapReduce Job?

    Read the article

  • Can someone clarify what this Joel On Software quote means?

    - by Bob
    I was reading Joel On Software today and ran across this quote: Without understanding functional programming, you can't invent MapReduce, the algorithm that makes Google so massively scalable. The terms Map and Reduce come from Lisp and functional programming. MapReduce is, in retrospect, obvious to anyone who remembers from their 6.001-equivalent programming class that purely functional programs have no side effects and are thus trivially parallelizable. What does he mean when he says functional programs have no side effects? And how does this make parallelizing trivial?

    Read the article

  • Can someone clarify what this Joel On Software quote means: (functional programs have no side effect

    - by Bob
    I was reading Joel On Software today and ran across this quote: Without understanding functional programming, you can't invent MapReduce, the algorithm that makes Google so massively scalable. The terms Map and Reduce come from Lisp and functional programming. MapReduce is, in retrospect, obvious to anyone who remembers from their 6.001-equivalent programming class that purely functional programs have no side effects and are thus trivially parallelizable. What does he mean when he says functional programs have no side effects? And how does this make parallelizing trivial?

    Read the article

  • Very basic question about Hadoop and compressed input files

    - by Luis Sisamon
    I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines). My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8  | Next Page >