Search Results

Search found 17 results on 1 pages for 'mahout'.

Page 1/1 | 1 

  • Mahout - Error when try out wikipedia exmaples

    - by Li'
    Note this post is similar to Caused by: java.lang.ClassNotFoundException: classpath but different error message. When I try to run Wikipedia Bayes Example from https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example When I ran the following command : lis-macbook-pro:mahout-distribution-0.8 Li$ mahout wikipediaXMLSplitter -d examples/temp/enwiki-latest-pages-articles10.xml -o wikipedia/chunks -c 64 I got error message: MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. MAHOUT_LOCAL is set, running locally SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/Li/File/Java/mahout-distribution-0.8/examples/target/mahout-examples-0.8-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/Li/File/Java/mahout-distribution-0.8/examples/target/dependency/slf4j-jcl-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory] Oct 21, 2013 4:25:47 PM org.slf4j.impl.JCLLoggerAdapter warn WARNING: Unable to add class: wikipediaXMLSplitter java.lang.ClassNotFoundException: wikipediaXMLSplitter at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:171) at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:236) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:127) I am using Hadoop 1.2 and Mahout 0.8. mahout-distribution-0.8/bin has been added to $PATH. $MAHOUT_LOCAL is set to "True", so it runs locally. I dont know why I got "Unable to add class: wikipediaXMLSplitter"

    Read the article

  • Is it worth purchasing Mahout in Action to get up to speed with Mahout, or are there other better sources?

    - by gab
    I'm currently a very casual user of Apache Mahout, and I'm considering purchasing the book Mahout in Action. Unfortunately, I'm having a really hard time getting an idea of how worth it this book is -- and seeing as it's a Manning Early Access Program book (and therefore only currently available as a beta-version e-book), I can't take a look myself in a bookstore. Can anyone recommend this as a good (or less good) guide to getting up to speed with Mahout, and/or other sources that can supplement the Mahout website?

    Read the article

  • Mahout - Clustering - "naming" the cluster elements

    - by Mark Bramnik
    I'm doing some research and I'm playing with Apache Mahout 0.6 My purpose is to build a system which will name different categories of documents based on user input. The documents are not known in advance and I don't know also which categories do I have while collecting these documents. But I do know, that all the documents in the model should belong to one of the predefined categories. For example: Lets say I've collected a N documents, that belong to 3 different groups : Politics Madonna (pop-star) Science fiction I don't know what document belongs to what category, but I know that each one of my N documents belongs to one of those categories (e.g. there are no documents about, say basketball among these N docs) So, I came up with the following idea: Apply mahout clustering (for example k-mean with k=3 on these documents) This should divide the N documents to 3 groups. This should be kind of my model to learn with. I still don't know which document really belongs to which group, but at least the documents are clustered now by group Ask the user to find any document in the web that should be about 'Madonna' (I can't show to the user none of my N documents, its a restriction). Then I want to measure 'similarity' of this document and each one of 3 groups. I expect to see that the measurement for similarity between user_doc and documents in Madonna group in the model will be higher than the similarity between the user_doc and documents about politics. I've managed to produce the cluster of documents using 'Mahout in Action' book. But I don't understand how should I use Mahout to measure similarity between the 'ready' cluster group of document and one given document. I thought about rerunning the cluster with k=3 for N+1 documents with the same centroids (in terms of k-mean clustering) and see whether where the new document falls, but maybe there is any other way to do that? Is it possible to do with Mahout or my idea is conceptually wrong? (example in terms of Mahout API would be really good) Thanks a lot and sorry for a long question (couldn't describe it better) Any help is highly appreciated P.S. This is not a home-work project :)

    Read the article

  • ClassNotFoundException error in implementing Bayesian algorithm in Apache Mahout on Hadoop

    - by Shweta
    Hi, I have a problem in executing the Bayesian algorithm in Mahout. I built it with Maven and the job file is in target directory. When run from terminal using hadoop, I'm getting the ClassNotFoundException error. What should be done? $HADOOP_HOME/bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.bayes.mapreduce.bayes.bayesdriver -i test -o output Exception in thread "main" java.lang.ClassNotFoundException: org.apache.mahout.classifier.bayes.mapreduce.bayes.bayesdriver at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

    Read the article

  • identify documents from results of mahout clustering

    - by Tejas
    I am using mahout to cluster text documents indexed using solr. I have used the "text" field in the document to form vectors. Then I used the k-means driver in mahout for clustering and then the clusterdumper utility to dump the results. I am having difficulty in understanding the output results from the dumper. I could see the clusters formed with term vectors in those clusters. But how do I extract the documents from these clusters. I want the result to be the input documents appearing in different clusters.

    Read the article

  • How to use Mahout in a Windows environment?

    - by oopdemo
    I am trying to use Mahout in an application running on Windows. I want to build clusters from a lucene index using k-means. As soon as I have to create sequence files (creating vectors from a lucene index), I get a Hadoop-Exception, since Hadoop makes command line calls to programs unknown in a Windows environment (e.g. chmod). Running in Cygwin is not an option, since I want to be able to run the App from eclipse. So my question is is there a way to avoid having to create sequence files to retrieve my vectors from a lucene index? or is there a way to create sequence files in a Windows environment?

    Read the article

  • Mahout Naive Bayes Classifier for Items

    - by Nimesh Parikh
    Team, I am working on a project where i need to classify Items into certain category. I have a single file as input; which contains target variable and space separated features. My training data will look like Category Name [Tab] DataString Plumbing [Tab] Pipe Tap Plastic Pipe PVC Pipe Cold Water Line Hot Water Line Tee outlet up Elbow turned up Elbow turned down Gate valve Globe valve Paint [Tab] Ivory Black Burnt Umber Caput Mortuum Violet Earth Red Yellow Ochre Titanium White Cadmium Yellow Light Cadmium Yellow Deep Cloths [Tab] Shirt T-Shirt Pent Jeans Tee Cargo Well, I have really big set of Category. I have couple of question here am i using correct data for Training? If no then what should i use? Once I train and Test my model, what is next step? How can i use output? Please help me with this Thanks, Nimesh

    Read the article

  • Mahout Recommendations on Binary data

    - by Pranay Kumar
    Hi, I'm a newbie to mahout.My aim is to produce recommendations on binary user purchased data.So i applied item-item similarity model in computing top N recommendations for movie lens data assuming 1-3 ratings as a 0 and 4-5 ratings as a 1.Then i tried evaluating my recommendations with the ratings in the test-data but hardly there have been two or three matches from my top 20 recommendations to the top rated items in test data and no match for most users. So are my recommendations totally bad by nature or do i need to go for a different measure for evaluating my recommendations ? Please help me ! Thanks in advance. Pranay, 2nd yr ,UG student.

    Read the article

  • Colloborative filtering

    - by Pranay Kumar
    How can i use SVD algorithm in mahout for producing recommendations on explicit binary data-set (eg. a user purchased or not but no specific ratings ) in an e-commerce domain ? Also what algorithms aim at producing recommendations on such binary data-sets ? Thanks in advance. Pranay Kumar, 2nd yr,cse

    Read the article

  • How to remove large number of files/folders in linux

    - by user1745713
    We are using hadoop to split a table into smaller files to feed to mahout, but in the process, we created a huge amount of _temporary logs. we have an nfs mount for the hadoop volume so we can use all the linux commands to delete folders files, but we just can't get them to be deleted, here's what I've tried so far: hadoop fs -rmr /.../_temporary : hangs for hours and does nothing on nfs mount: rmr -rf /.../_temporary :hangs for hours and does nothing find . -name '*.*' -type f -delete : same as above the folders look like this (38 of these folders inside _temporary): drwxr-xr-x 319324 user user 319322 Oct 24 12:12 _attempt_201310221525_0404_r_000000_0 the content of these are actually folders, not files. each one of those 319322 folders has exactly one file inside. not sure why the do the logging this way. Any help is appreciated.

    Read the article

  • What's the best C# recommendation engine or framework?

    - by cDima
    Is there anyway to use the examples for the "My Media" Microsoft research project? My Media is a "dynamic personalization and recommendation software framework toolkit" ( http://www.mymediaproject.org ), but out of the box it doesn't provide a sample database (only a LINQ-to-SQL .dbml schema), I don't believe it will be easy to re-create by hand. I was hoping to understand recommendation engines and machine learning with this C#/.Net as a testbed, but without a simple quick start or db it seems impractical. Any suggestions? (I guess it's time to switch to Java with Apache's Mahout, Weka or something similar?)

    Read the article

  • KMeans clustering for more than 5 million vectors

    - by Wajih
    I have hit a real problem. I need to do some Kmeans clustering for 5 million vectors, each containing about 32 cols. I tried out Mahout which requires linux and I am on windows, I am restrained from using a Linux OS and any sort of simulator. Can anyone suggest a KMeans clustering algorithm that is scalable upto 5M vectors and can converge quickly? I have tested a few but they wont scale. Which means they are slow and take forever to complete. Thanks

    Read the article

  • Windows Azure Recipe: Big Data

    - by Clint Edmonson
    As the name implies, what we’re talking about here is the explosion of electronic data that comes from huge volumes of transactions, devices, and sensors being captured by businesses today. This data often comes in unstructured formats and/or too fast for us to effectively process in real time. Collectively, we call these the 4 big data V’s: Volume, Velocity, Variety, and Variability. These qualities make this type of data best managed by NoSQL systems like Hadoop, rather than by conventional Relational Database Management System (RDBMS). We know that there are patterns hidden inside this data that might provide competitive insight into market trends.  The key is knowing when and how to leverage these “No SQL” tools combined with traditional business such as SQL-based relational databases and warehouses and other business intelligence tools. Drivers Petabyte scale data collection and storage Business intelligence and insight Solution The sketch below shows one of many big data solutions using Hadoop’s unique highly scalable storage and parallel processing capabilities combined with Microsoft Office’s Business Intelligence Components to access the data in the cluster. Ingredients Hadoop – this big data industry heavyweight provides both large scale data storage infrastructure and a highly parallelized map-reduce processing engine to crunch through the data efficiently. Here are the key pieces of the environment: Pig - a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Mahout - a machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are implemented on top of Apache Hadoop using the map/reduce paradigm. Hive - data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in distributed storage. Directly accessible to Microsoft Office and other consumers via add-ins and the Hive ODBC data driver. Pegasus - a Peta-scale graph mining system that runs in parallel, distributed manner on top of Hadoop and that provides algorithms for important graph mining tasks such as Degree, PageRank, Random Walk with Restart (RWR), Radius, and Connected Components. Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large log data amounts to HDFS. Database – directly accessible to Hadoop via the Sqoop based Microsoft SQL Server Connector for Apache Hadoop, data can be efficiently transferred to traditional relational data stores for replication, reporting, or other needs. Reporting – provides easily consumable reporting when combined with a database being fed from the Hadoop environment. Training These links point to online Windows Azure training labs where you can learn more about the individual ingredients described above. Hadoop Learning Resources (20+ tutorials and labs) Huge collection of resources for learning about all aspects of Apache Hadoop-based development on Windows Azure and the Hadoop and Windows Azure Ecosystems SQL Azure (7 labs) Microsoft SQL Azure delivers on the Microsoft Data Platform vision of extending the SQL Server capabilities to the cloud as web-based services, enabling you to store structured, semi-structured, and unstructured data. See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.

    Read the article

  • The Buzz at the JavaOne Bookstore

    - by Janice J. Heiss
    I found my way to the JavaOne bookstore, a hub of activity. Who says brick and mortar bookstores are dead? I asked what was hot and got two answers: Hadoop in Practice by Alex Holmes was doing well. And Scala for the Impatient by noted Java Champion Cay Horstmann also seemed to be a fast seller. Hadoop in PracticeHadoop is a framework that organizes large clusters of computers around a problem. It is touted as especially effective for large amounts of data, and is use such companies as  Facebook, Yahoo, Apple, eBay and LinkedIn. Hadoop in Practice collects nearly 100 Hadoop examples and presents them in a problem/solution format with step by step explanations of solutions and designs. It’s very much a participatory book intended to make developers more at home with Hadoop.The author, Alex Holmes, is a senior software engineer with more than 15 years of experience developing large-scale distributed Java systems. For the last four years, he has gained expertise in Hadoop solving Big Data problems across a number of projects. He has presented at JavaOne and Jazoon and is currently a technical lead at VeriSign.At this year’s JavaOne, he is presenting a session with VeriSign colleague, Karthik Shyamsunder called “Java: A Perfect Platform for Data Science” where they will explain how the Java platform has emerged as a perfect platform for practicing data science, and also talk about such technologies as Hadoop, Hive, Pig, HBase, Cassandra, and Mahout. Scala for the ImpatientSan Jose State University computer science professor and Java Champion Cay Horstmann is the principal author of the highly regarded Core Java. Scala for the Impatient is a basic, practical introduction to Scala for experienced programmers. Horstmann has a presentation summarizing the themes of his book on at his website. On the final page he offers an enticing summary of his conclusions:* Widespread dissatisfaction with Java + XML + IDEs               --Don't make me eat Elephant again * A separate language for every problem domain is not efficient               --It takes time to master the idioms* ”JavaScript Everywhere” isn't going to scale* Trend is towards languages with more expressive power, less boilerplate* Will Scala be the “one ring to rule them”?* Maybe              --If it succeeds in industry             --If student-friendly subsets and tools are created The popularity of both books echoed comments by IBM Distinguished Engineer Jason McGee who closed his part of the Sunday JavaOne keynote by pointing out that the use of Java in complex applications is increasingly being augmented by a host of other languages with strong communities around them – JavaScript, JRuby, Scala, Python and so forth. Java developers increasingly must know the strengths and weaknesses of such languages going forward.

    Read the article

  • Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned how to become a Data Scientist for Big Data. In this article we will go over various learning resources related to Big Data. In this series we have covered many of the most essential details about Big Data. At the beginning of this series, I have encouraged readers to send me questions. One of the most popular questions is - “I want to learn more about Big Data. Where can I learn it?” This is indeed a great question as there are plenty of resources out to learn about Big Data and it is indeed difficult to select on one resource to learn Big Data. Hence I decided to write here a few of the very important resources which are related to Big Data. Learn from Pluralsight Pluralsight is a global leader in high-quality online training for hardcore developers.  It has fantastic Big Data Courses and I started to learn about Big Data with the help of Pluralsight. Here are few of the courses which are directly related to Big Data. Big Data: The Big Picture Big Data Analytics with Tableau NoSQL: The Big Picture Understanding NoSQL Data Analysis Fundamentals with Tableau I encourage all of you start with this video course as they are fantastic fundamentals to learn Big Data. Learn from Apache Resources at Apache are single point the most authentic learning resources. If you want to learn fundamentals and go deep about every aspect of the Big Data, I believe you must understand various concepts in Apache’s library. I am pretty impressed with the documentation and I am personally referencing it every single day when I work with Big Data. I strongly encourage all of you to bookmark following all the links for authentic big data learning. Haddop - The Apache Hadoop® project develops open-source software for reliable, scalable, distributed computing. Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which include support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner. Avro: A data serialization system. Cassandra: A scalable multi-master database with no single points of failure. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications. Learn from Vendors One of the biggest issues with about learning Big Data is setting up the environment. Every Big Data vendor has different environment request and there are lots of things require to set up Big Data framework. Many of the users do not start with Big Data as they are afraid about the resources required to set up framework as well as a time commitment. Here Hortonworks have created fantastic learning environment. They have created Sandbox with everything one person needs to learn Big Data and also have provided excellent tutoring along with it. Sandbox comes with a dozen hands-on tutorial that will guide you through the basics of Hadoop as well it contains the Hortonworks Data Platform. I think Hortonworks did a fantastic job building this Sandbox and Tutorial. Though there are plenty of different Big Data Vendors I have decided to list only Hortonworks due to their unique setup. Please leave a comment if there are any other such platform to learn Big Data. I will include them over here as well. Learn from Books There are indeed few good books out there which one can refer to learn Big Data. Here are few good books which I have read. I will update the list as I will learn more. Ethics of Big Data Balancing Risk and Innovation Big Data for Dummies Head First Data Analysis: A Learner’s Guide to Big Numbers, Statistics, and Good Decisions If you search on Amazon there are millions of the books but I think above three books are a great set of books and it will give you great ideas about Big Data. Once you go through above books, you will have a clear idea about what is the next step you should follow in this series. You will be capable enough to make the right decision for yourself. Tomorrow In tomorrow’s blog post we will wrap up this series of Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

1