hdfs - Page 3 - Developer IT

How much power supply do I need for my server, and could a shortage be causing my odd crashing?

- by dolan

I have 5 servers, all with similar hardware (i7, four 2tb 7200rpm drives, two 4tb 5400rpm drives, 430 watt power supply), and lately the machines have been freezing up. This has gotten worse in the last day or so, and I can't pinpoint any explanation. One recent change was adding the two 4tb hard drives. The crashes happen most often while running a large Hadoop job, so I was originally thinking the load was causing some issues, but last night one server just froze without any heavy load on the box (or so I think), other than HDFS (Hadoop's distributed file system) was probably rebalancing itself since two of the five nodes were offline. If I plugin a monitor and keyboard to one of these frozen machines, I can't get any response or feedback on the screen. Any ideas on possible points of failure and/or different logs I can look at to investigate? Thanks Edit: The systems are running Ubuntu 10.04 Edit 2: More on hardware: intel core i7-930 bloomfield 2.8ghz processor (quad core) 12gb (6 x 2gb) kingston ddr3 1333 ram antec earthwatts green 430 power supply msi x58m lga 1366 motherboard

Read the article

Hadoop Rolling Small files

- by Arenstar

I am running Hadoop on a project and need a suggestion. Generally by default Hadoop has a "block size" of around 64mb.. There is also a suggestion to not use many/small files.. I am currently having very very very small files being put into HDFS due to the application design of flume.. The problem is, that Hadoop <= 0.20 cannot append to files, whereby i have too many files for my map-reduce to function efficiently.. There must be a correct way to simply roll/merge roughly 100 files into one.. Therefore Hadoop is effectively reading 1 large file instead of 10 Any Suggestions??

Read the article

Best practice for administering a (hadoop) cluster

- by Alex

Dear all, I've recently been playing with Hadoop. I have a six node cluster up and running - with HDFS, and having run a number of MapRed jobs. So far, so good. However I'm now looking to do this more systematically and with a larger number of nodes. Our base system is Ubuntu and the current setup has been administered using apt (to install the correct java runtime) and ssh/scp (to propagate out the various conf files). This is clearly not scalable over time. Does anyone have any experience of good systems for administering (possibly slightly heterogenous: different disk sizes, different numbers of cpus on each node) hadoop clusters automagically? I would consider diskless boot - but imagine that with a large cluster, getting the cluster up and running might be bottle-necked on the machine serving the OS. Or some form of distributed debian apt to keep the machines native environment synchronised? And how do people successfully manage the conf files over a number of (potentially heterogenous) machines? Thanks very much in advance, Alex

Read the article

Stream tar.gz file from FTP server

- by linker

Here is the situation: I have a tar.gz file on a FTP server which can contain an arbitrary number of files. Now what I'm trying to accomplish is have this file streamed and uploaded to HDFS through a Hadoop job. The fact that it's Hadoop is not important, in the end what I need to do is write some shell script that would take this file form ftp with wget and write the output to a stream. The reason why I really need to use streams is that there will be a large number of these files, and each file will be huge. It's fairly easy to do if I have a gzipped file and I'm doing something like this: wget -O - "ftp://${user}:${pass}@${host}/$file" | zcat But I'm not even sure if this is possible for a tar.gz file, especially since there are mutliple files in the archive. I'm a bit confused on what direction to take for this, any help would be greatly appreciated.

Read the article

Big Data – Learning Basics of Big Data in 21 Days – Bookmark

- by Pinal Dave

Earlier this month I had a great time to write Bascis of Big Data series. This series received great response and lots of good comments I have received, I am going to follow up this basics series with further in-depth series in near future. Here is the consolidated blog post where you can find all the 21 days blog posts together. Bookmark this page for future reference. Big Data – Beginning Big Data – Day 1 of 21 Big Data – What is Big Data – 3 Vs of Big Data – Volume, Velocity and Variety – Day 2 of 21 Big Data – Evolution of Big Data – Day 3 of 21 Big Data – Basics of Big Data Architecture – Day 4 of 21 Big Data – Buzz Words: What is NoSQL – Day 5 of 21 Big Data – Buzz Words: What is Hadoop – Day 6 of 21 Big Data – Buzz Words: What is MapReduce – Day 7 of 21 Big Data – Buzz Words: What is HDFS – Day 8 of 21 Big Data – Buzz Words: Importance of Relational Database in Big Data World – Day 9 of 21 Big Data – Buzz Words: What is NewSQL – Day 10 of 21 Big Data – Role of Cloud Computing in Big Data – Day 11 of 21 Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL – Day 12 of 21 Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21 Big Data – Operational Databases Supporting Big Data – Columnar, Graph and Spatial Database – Day 14 of 21 Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21 Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21 Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper? – Day 17 of 21 Big Data – Basics of Big Data Analytics – Day 18 of 21 Big Data – How to become a Data Scientist and Learn Data Science? – Day 19 of 21 Big Data – Various Learning Resources – How to Start with Big Data? – Day 20 of 21 Big Data – Final Wrap and What Next – Day 21 of 21 Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article

Oracle Big Data Learning Library - Click on LEARN BY PRODUCT to Open Page

- by chberger

Oracle Big Data Learning Library... Learn about Oracle Big Data, Data Science, Learning Analytics, Oracle NoSQL Database, and more! Oracle Big Data Essentials Attend this Oracle University Course! Using Oracle NoSQL Database Attend this Oracle University class! Oracle and Big Data on OTN See the latest resource on OTN. Search Welcome Get Started Learn by Role Learn by Product Latest Additions Additional Resources Oracle Big Data Appliance Oracle Big Data and Data Science Basics Meeting the Challenge of Big Data Oracle Big Data Tutorial Video Series Oracle MoviePlex - a Big Data End-to-End Series of Demonstrations Oracle Big Data Overview Oracle Big Data Essentials Data Mining Oracle NoSQL Database Tutorial Videos Oracle NoSQL Database Tutorial Series Oracle NoSQL Database Release 2 New Features Using Oracle NoSQL Database Exalytics Enterprise Manager 12c R3: Manage Exalytics Setting Up and Running Summary Advisor on an E s Oracle R Enterprise Oracle R Enterprise Tutorial Series Oracle Big Data Connectors Integrate All Your Data with Oracle Big Data Connectors Using Oracle Direct Connector for HDFS to Read the Data from HDSF Using Oracle R Connector for Hadoop to Analyze Data Oracle NoSQL Database Oracle NoSQL Database Tutorial Videos Oracle NoSQL Database Tutorial Series Oracle NoSQL Database Release 2 New Features Using Oracle NoSQL Database eries Oracle Business Intelligence Enterprise Edition Oracle Business Intelligence Oracle BI 11g R1: Create Analyses and Dashboards - 4 day class Oracle BI Publisher 11g R1: Fundamentals - 3 day class Oracle BI 11g R1: Build Repositories - 5 day class

Read the article

Why Oracle Data Integrator for Big Data?

- by Mala Narasimharajan

Big Data is everywhere these days - but what exactly is it? It’s data that comes from a multitude of sources – not only structured data, but unstructured data as well. The sheer volume of data is mindboggling – here are a few examples of big data: climate information collected from sensors, social media information, digital pictures, log files, online video files, medical records or online transaction records. These are just a few examples of what constitutes big data. Embedded in big data is tremendous value and being able to manipulate, load, transform and analyze big data is key to enhancing productivity and competitiveness. The value of big data lies in its propensity for greater in-depth analysis and data segmentation -- in turn giving companies detailed information on product performance, customer preferences and inventory. Furthermore, by being able to store and create more data in digital form, “big data can unlock significant value by making information transparent and usable at much higher frequency." (McKinsey Global Institute, May 2011) Oracle's flagship product for bulk data movement and transformation, Oracle Data Integrator, is a critical component of Oracle’s Big Data strategy. ODI provides automation, bulk loading, and validation and transformation capabilities for Big Data while minimizing the complexities of using Hadoop. Specifically, the advantages of ODI in a Big Data scenario are due to pre-built Knowledge Modules that drive processing in Hadoop. This leverages the graphical UI to load and unload data from Hadoop, perform data validations and create mapping expressions for transformations. The Knowledge Modules provide a key jump-start and eliminate a significant amount of Hadoop development. Using Oracle Data Integrator together with Oracle Big Data Connectors, you can simplify the complexities of mapping, accessing, and loading big data (via NoSQL or HDFS) but also correlating your enterprise data – this correlation may require integrating across heterogeneous and standards-based environments, connecting to Oracle Exadata, or sourcing via a big data platform such as Oracle Big Data Appliance. To learn more about Oracle Data Integration and Big Data, download our resource kit to see the latest in whitepapers, webinars, downloads, and more… or go to our website on www.oracle.com/bigdata

Read the article

Hadoop WordCount example stuck at map 100% reduce 0%

- by Abhinav Sharma

[hadoop-1.0.2] ? hadoop jar hadoop-examples-1.0.2.jar wordcount /user/abhinav/input /user/abhinav/output Warning: $HADOOP_HOME is deprecated. ****hdfs://localhost:54310/user/abhinav/input 12/04/15 15:52:31 INFO input.FileInputFormat: Total input paths to process : 1 12/04/15 15:52:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/04/15 15:52:31 WARN snappy.LoadSnappy: Snappy native library not loaded 12/04/15 15:52:31 INFO mapred.JobClient: Running job: job_201204151241_0010 12/04/15 15:52:32 INFO mapred.JobClient: map 0% reduce 0% 12/04/15 15:52:46 INFO mapred.JobClient: map 100% reduce 0% I've set up hadoop on a single node using this guide (http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#run-the-mapreduce-job) and I'm trying to run a provided example but I'm getting stuck at map 100% reduce 0%. What could be causing this?

Read the article

OpenStreetMap and Hadoop

- by portoalet

Hi, I need some ideas for a weekend project about Hadoop and OpenStreetMap. I have access to AWS EC2 instance with OpenStreetMap snapshot in my EBS volume. The OpenStreetMap data is in a PostgreSQL database. What kind of MapReduce function can be run on the OpenStreetMap data, assuming I can export them into xml format, and then place into HDFS ? In other words, I am having a brain cramp at the moment, and cannot think what kind of MapReduce operation that can extract valuable insight from the OpenStreetMap xml? (i.e. extract all the places designated as park or golf course. But this needs to be done once only, not continuously) Many Thanks

Read the article

How does Hadoop perform input splits?

- by Deepak Konidena

Hi, This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line. Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated? All i want to know is, whether the splits are done internally or do we have to split the data manually? More specifically, each time the map() function is called what are its Key key and Value val parameters? Thanks, Deepak

Read the article

Hadoop on windows server

- by Luca Martinetti

Hello, I'm thinking about using hadoop to process large text files on my existing windows 2003 servers (about 10 quad core machines with 16gb of RAM) The questions are: Is there any good tutorial on how to configure an hadoop cluster on windows? What are the requirements? java + cygwin + sshd ? Anything else? HDFS, does it play nice on windows? I'd like to use hadoop in streaming mode. Any advice, tool or trick to develop my own mapper / reducers in c#? What do you use for submitting and monitoring the jobs? Thanks

Read the article

Hadoop on Amazon EC2 : Job tracker not starting properly

- by Algorist

Hi, We are running Hadoop on Amazon EC2 cluster. We start the master, slaves and attach the ebs volumes and finally waiting for hadoop jobtracker, tasktracker etc to start and we have timeout of 3600 seconds. We are noticing 50% of the time that job tracker is not able to start before the timeout. Reason being, hdfs is not initialized properly and still in safemode and job tracker is unable to start. I noticed few connectivity issues between nodes on EC2 as I tried manually pinging slaves. Did anyone face similar issue and know how to solve this? Thank you Bala

Read the article

MySQL and Hadoop Integration - Unlocking New Insight

- by Mat Keep

“Big Data” offers the potential for organizations to revolutionize their operations. With the volume of business data doubling every 1.2 years, analysts and business users are discovering very real benefits when integrating and analyzing data from multiple sources, enabling deeper insight into their customers, partners, and business processes. As the world’s most popular open source database, and the most deployed database in the web and cloud, MySQL is a key component of many big data platforms, with Hadoop vendors estimating 80% of deployments are integrated with MySQL. The new Guide to MySQL and Hadoop presents the tools enabling integration between the two data platforms, supporting the data lifecycle from acquisition and organisation to analysis and visualisation / decision, as shown in the figure below The Guide details each of these stages and the technologies supporting them: Acquire: Through new NoSQL APIs, MySQL is able to ingest high volume, high velocity data, without sacrificing ACID guarantees, thereby ensuring data quality. Real-time analytics can also be run against newly acquired data, enabling immediate business insight, before data is loaded into Hadoop. In addition, sensitive data can be pre-processed, for example healthcare or financial services records can be anonymized, before transfer to Hadoop. Organize: Data is transferred from MySQL tables to Hadoop using Apache Sqoop. With the MySQL Binlog (Binary Log) API, users can also invoke real-time change data capture processes to stream updates to HDFS. Analyze: Multi-structured data ingested from multiple sources is consolidated and processed within the Hadoop platform. Decide: The results of the analysis are loaded back to MySQL via Apache Sqoop where they inform real-time operational processes or provide source data for BI analytics tools. So how are companies taking advantage of this today? As an example, on-line retailers can use big data from their web properties to better understand site visitors’ activities, such as paths through the site, pages viewed, and comments posted. This knowledge can be combined with user profiles and purchasing history to gain a better understanding of customers, and the delivery of highly targeted offers. Of course, it is not just in the web that big data can make a difference. Every business activity can benefit, with other common use cases including: - Sentiment analysis; - Marketing campaign analysis; - Customer churn modeling; - Fraud detection; - Research and Development; - Risk Modeling; - And more. As the guide discusses, Big Data is promising a significant transformation of the way organizations leverage data to run their businesses. MySQL can be seamlessly integrated within a Big Data lifecycle, enabling the unification of multi-structured data into common data platforms, taking advantage of all new data sources and yielding more insight than was ever previously imaginable. Download the guide to MySQL and Hadoop integration to learn more. I'd also be interested in hearing about how you are integrating MySQL with Hadoop today, and your requirements for the future, so please use the comments on this blog to share your insights.

Read the article

Fast Data: Go Big. Go Fast.

- by J Swaroop

Cross-posting Dain Hansen's excellent recap of the Big Data/Fast Data announcement during OOW: For those of you who may have missed it, today’s second full day of Oracle OpenWorld 2012 started with a rumpus. Joe Tucci, from EMC outlined the human face of big data with real examples of how big data is transforming our world. And no not the usual tried-and-true weblog examples, but real stories about taxi cab drivers in Singapore using big data to better optimize their routes as well as folks just trying to get a better hair cut. Next we heard from Thomas Kurian who talked at length about the important platform characteristics of Oracle’s Cloud and more specifically Oracle’s expanded Cloud Services portfolio. Especially interesting to our integration customers are the messaging support for Oracle’s Cloud applications. What this means is that now Oracle’s Cloud applications have a lightweight integration fabric that on-premise applications can communicate to it via REST-APIs using Oracle SOA Suite. It’s an important element to our strategy at Oracle that supports this idea that whether your requirements are for private or public, Oracle has a solution in the Cloud for all of your applications and we give you more deployment choice than any vendor. If this wasn’t enough to get the juices flowing, later that morning we heard from Hasan Rizvi who outlined in his Fusion Middleware session the four most important enterprise imperatives: Social, Mobile, Cloud, and a brand new one: Fast Data. Today, Rizvi made an important step in the definition of this term to explain that he believes it’s a convergence of four essential technology elements: Event Processing for event filtering, business rules – with Oracle Event Processing Data Transformation and Loading - with Oracle Data Integrator Real-time replication and integration – with Oracle GoldenGate Analytics and data discovery – with Oracle Business Intelligence Each of these four elements can be considered (and architect-ed) together on a single integrated platform that can help customers integrate any type of data (structured, semi-structured) leveraging new styles of big data technologies (MapReduce, HDFS, Hive, NoSQL) to process more volume and variety of data at a faster velocity with greater results. Fast data processing (and especially real-time) has always been our credo at Oracle with each one of these products in Fusion Middleware. For example, Oracle GoldenGate continues to be made even faster with the recent 11g R2 Release of Oracle GoldenGate which gives us some even greater optimization to Oracle Database with Integrated Capture, as well as some new heterogeneity capabilities. With Oracle Data Integrator with Big Data Connectors, we’re seeing much improved performance by running MapReduce transformations natively on Hadoop systems. And with Oracle Event Processing we’re seeing some remarkable performance with customers like NTT Docomo. Check out their upcoming session at Oracle OpenWorld on Wednesday to hear more how this customer is using Event processing and Big Data together. If you missed any of these sessions and keynotes, not to worry. There's on-demand versions available on the Oracle OpenWorld website. You can also checkout our upcoming webcast where we will outline some of these new breakthroughs in Data Integration technologies for Big Data, Cloud, and Real-time in more details.

Read the article

Windows Azure Recipe: Big Data

- by Clint Edmonson

As the name implies, what we’re talking about here is the explosion of electronic data that comes from huge volumes of transactions, devices, and sensors being captured by businesses today. This data often comes in unstructured formats and/or too fast for us to effectively process in real time. Collectively, we call these the 4 big data V’s: Volume, Velocity, Variety, and Variability. These qualities make this type of data best managed by NoSQL systems like Hadoop, rather than by conventional Relational Database Management System (RDBMS). We know that there are patterns hidden inside this data that might provide competitive insight into market trends. The key is knowing when and how to leverage these “No SQL” tools combined with traditional business such as SQL-based relational databases and warehouses and other business intelligence tools. Drivers Petabyte scale data collection and storage Business intelligence and insight Solution The sketch below shows one of many big data solutions using Hadoop’s unique highly scalable storage and parallel processing capabilities combined with Microsoft Office’s Business Intelligence Components to access the data in the cluster. Ingredients Hadoop – this big data industry heavyweight provides both large scale data storage infrastructure and a highly parallelized map-reduce processing engine to crunch through the data efficiently. Here are the key pieces of the environment: Pig - a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Mahout - a machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are implemented on top of Apache Hadoop using the map/reduce paradigm. Hive - data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in distributed storage. Directly accessible to Microsoft Office and other consumers via add-ins and the Hive ODBC data driver. Pegasus - a Peta-scale graph mining system that runs in parallel, distributed manner on top of Hadoop and that provides algorithms for important graph mining tasks such as Degree, PageRank, Random Walk with Restart (RWR), Radius, and Connected Components. Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large log data amounts to HDFS. Database – directly accessible to Hadoop via the Sqoop based Microsoft SQL Server Connector for Apache Hadoop, data can be efficiently transferred to traditional relational data stores for replication, reporting, or other needs. Reporting – provides easily consumable reporting when combined with a database being fed from the Hadoop environment. Training These links point to online Windows Azure training labs where you can learn more about the individual ingredients described above. Hadoop Learning Resources (20+ tutorials and labs) Huge collection of resources for learning about all aspects of Apache Hadoop-based development on Windows Azure and the Hadoop and Windows Azure Ecosystems SQL Azure (7 labs) Microsoft SQL Azure delivers on the Microsoft Data Platform vision of extending the SQL Server capabilities to the cloud as web-based services, enabling you to store structured, semi-structured, and unstructured data. See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.

Read the article

Tackling Big Data Analytics with Oracle Data Integrator

- by Irem Radzik

v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman";} By Mike Eisterer The term big data draws a lot of attention, but behind the hype there's a simple story. For decades, companies have been making business decisions based on transactional data stored in relational databases. Beyond that critical data, however, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and documents that can be mined for useful information. Companies are facing emerging technologies, increasing data volumes, numerous data varieties and the processing power needed to efficiently analyze data which changes with high velocity. Oracle offers the broadest and most integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships Oracle Data Integrator Enterprise Edition(ODI) is critical to any enterprise big data strategy. ODI and the Oracle Data Connectors provide native access to Hadoop, leveraging such technologies as MapReduce, HDFS and Hive. Alongside with ODI’s metadata driven approach for extracting, loading and transforming data; companies may now integrate their existing data with big data technologies and deliver timely and trusted data to their analytic and decision support platforms. In this session, you’ll learn about ODI and Oracle Big Data Connectors and how, coupled together, they provide the critical integration with multiple big data platforms. Tackling Big Data Analytics with Oracle Data Integrator October 1, 2012 12:15 PM at MOSCONE WEST – 3005 For other data integration sessions at OpenWorld, please check our Focus-On document. If you are not able to attend OpenWorld, please check out our latest resources for Data Integration.

Read the article

Hadoop initscript askes password

- by Ramesh

I have installed hadoop on my ubuntu 12.04 single node .I am trying to execute an init script to make the hadoop run on start up but it asks password every time i execute. #!/bin/sh ### BEGIN INIT INFO # Provides: hadoop services # Required-Start: $network # Required-Stop: $network # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Description: Hadoop services # Short-Description: Enable Hadoop services including hdfs ### END INIT INFO PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin HADOOP_BIN=/home/naveen/softwares/hadoop-1.0.3/bin NAME=hadoop DESC=hadoop USER=naveen ROTATE_SUFFIX= test -x $HADOOP_BIN || exit 0 RETVAL=0 set -e cd / start_hadoop () { set +e su $USER -s /bin/sh -c $HADOOP_BIN/start-all.sh > /var/log/hadoop/startup_log case "$?" in 0) echo SUCCESS RETVAL=0 ;; 1) echo TIMEOUT - check /var/log/hadoop/startup_log RETVAL=1 ;; *) echo FAILED - check /var/log/hadoop/startup_log RETVAL=1 ;; esac set -e } stop_hadoop () { set +e if [ $RETVAL = 0 ] ; then su $USER -s /bin/sh -c $HADOOP_BIN/stop-all.sh > /var/log/hadoop/shutdown_log RETVAL=$? if [ $RETVAL != 0 ] ; then echo FAILED - check /var/log/hadoop/shutdown_log fi else echo No nodes running RETVAL=0 fi set -e } restart_hadoop() { stop_hadoop start_hadoop } case "$1" in start) echo -n "Starting $DESC: " start_hadoop echo "$NAME." ;; stop) echo -n "Stopping $DESC: " stop_hadoop echo "$NAME." ;; force-reload|restart) echo -n "Restarting $DESC: " restart_hadoop echo "$NAME." ;; *) echo "Usage: $0 {start|stop|restart|force-reload}" >&2 RETVAL=1 ;; esac exit $RETVAL Please tell me how to run hadoop without entering password.

Read the article

Distributed storage and computing

- by Tim van Elteren

Dear Serverfault community, After researching a number of distributed file systems for deployment in a production environment with the main purpose of performing both batch and real-time distributed computing I've identified the following list as potential candidates, mainly on maturity, license and support: Ceph Lustre GlusterFS HDFS FhGFS MooseFS XtreemFS The key properties that our system should exhibit: an open source, liberally licensed, yet production ready, e.g. a mature, reliable, community and commercially supported solution; ability to run on commodity hardware, preferably be designed for it; provide high availability of the data with the most focus on reads; high scalability, so operation over multiple data centres, possibly on a global scale; removal of single points of failure with the use of replication and distribution of (meta-)data, e.g. provide fault-tolerance. The sensitivity points that were identified, and resulted in the following questions, are: transparency to the processing layer / application with respect to data locality, e.g. know where data is physically located on a server level, mainly for resource allocation and fast processing, high performance, how can this be accomplished? Do you from experience know what solutions provide this transparency and to what extent? posix compliance, or conformance, is mentioned on the wiki pages of most of the above listed solutions. The question here mainly is, how relevant is support for the posix standard? Hadoop for example isn't posix compliant by design, what are the pro's and con's? what about the difference between synchronous and asynchronous opeartion of a distributed file system. Though a synchronous distributed file system has the preference because of reliability it also imposes certain limitations with respect to scalability. What would be, from your expertise, the way to go on this? I'm looking forward to your replies. Thanks in advance! :) With kind regards, Tim van Elteren

Read the article

Big Data – Buzz Words: What is MapReduce – Day 7 of 21

- by Pinal Dave

In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce. What is MapReduce? MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time. MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programing. The library where procedure Map() and Reduce() belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop which we will explore tomorrow. Advantages of MapReduce Procedures The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node. As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines. How do MapReduce Framework Works? A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers. Map() Procedure There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node. Reduce() Procedure All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node. The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data). The MapReduce Framework has five different steps: Preparing Map() Input Executing User Provided Map() Code Shuffle Map Output to Reduce Processor Executing User Provided Reduce Code Producing the Final Output Here is the Dataflow of MapReduce Framework: Input Reader Map Function Partition Function Compare Function Reduce Function Output Writer In a future blog post of this 31 day series we will explore various components of MapReduce in Detail. MapReduce in a Single Statement MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – HDFS. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article

Oracle Data Integration 12c: Simplified, Future-Ready, High-Performance Solutions

- by Thanos Terentes Printzios

In today’s data-driven business environment, organizations need to cost-effectively manage the ever-growing streams of information originating both inside and outside the firewall and address emerging deployment styles like cloud, big data analytics, and real-time replication. Oracle Data Integration delivers pervasive and continuous access to timely and trusted data across heterogeneous systems. Oracle is enhancing its data integration offering announcing the general availability of 12c release for the key data integration products: Oracle Data Integrator 12c and Oracle GoldenGate 12c, delivering Simplified and High-Performance Solutions for Cloud, Big Data Analytics, and Real-Time Replication. The new release delivers extreme performance, increase IT productivity, and simplify deployment, while helping IT organizations to keep pace with new data-oriented technology trends including cloud computing, big data analytics, real-time business intelligence. With the 12c release Oracle becomes the new leader in the data integration and replication technologies as no other vendor offers such a complete set of data integration capabilities for pervasive, continuous access to trusted data across Oracle platforms as well as third-party systems and applications. Oracle Data Integration 12c release addresses data-driven organizations’ critical and evolving data integration requirements under 3 key themes: Future-Ready Solutions : Supporting Current and Emerging Initiatives Extreme Performance : Even higher performance than ever before Fast Time-to-Value : Higher IT Productivity and Simplified Solutions With the new capabilities in Oracle Data Integrator 12c, customers can benefit from: Superior developer productivity, ease of use, and rapid time-to-market with the new flow-based mapping model, reusable mappings, and step-by-step debugger. Increased performance when executing data integration processes due to improved parallelism. Improved productivity and monitoring via tighter integration with Oracle GoldenGate 12c and Oracle Enterprise Manager 12c. Improved interoperability with Oracle Warehouse Builder which enables faster and easier migration to Oracle Data Integrator’s strategic data integration offering. Faster implementation of business analytics through Oracle Data Integrator pre-integrated with Oracle BI Applications’ latest release. Oracle Data Integrator also integrates simply and easily with Oracle Business Analytics tools, including OBI-EE and Oracle Hyperion. Support for loading and transforming big and fast data, enabled by integration with big data technologies: Hadoop, Hive, HDFS, and Oracle Big Data Appliance. Only Oracle GoldenGate provides the best-of-breed real-time replication of data in heterogeneous data environments. With the new capabilities in Oracle GoldenGate 12c, customers can benefit from: Simplified setup and management of Oracle GoldenGate 12c when using multiple database delivery processes via a new Coordinated Delivery feature for non-Oracle databases. Expanded heterogeneity through added support for the latest versions of major databases such as Sybase ASE v 15.7, MySQL NDB Clusters 7.2, and MySQL 5.6., as well as integration with Oracle Coherence. Enhanced high availability and data protection via integration with Oracle Data Guard and Fast-Start Failover integration. Enhanced security for credentials and encryption keys using Oracle Wallet. Real-time replication for databases hosted on public cloud environments supported by third-party clouds. Tight integration between Oracle Data Integrator 12c and Oracle GoldenGate 12c and other Oracle technologies, such as Oracle Database 12c and Oracle Applications, provides a number of benefits for organizations: Tight integration between Oracle Data Integrator 12c and Oracle GoldenGate 12c enables developers to leverage Oracle GoldenGate’s low overhead, real-time change data capture completely within the Oracle Data Integrator Studio without additional training. Integration with Oracle Database 12c provides a strong foundation for seamless private cloud deployments. Delivers real-time data for reporting, zero downtime migration, and improved performance and availability for Oracle Applications, such as Oracle E-Business Suite and ATG Web Commerce . Oracle’s data integration offering is optimized for Oracle Engineered Systems and is an integral part of Oracle’s fast data, real-time analytics strategy on Oracle Exadata Database Machine and Oracle Exalytics In-Memory Machine. Oracle Data Integrator 12c and Oracle GoldenGate 12c differentiate the new offering on data integration with these many new features. This is just a quick glimpse into Oracle Data Integrator 12c and Oracle GoldenGate 12c. Find out much more about the new release in the video webcast "Introducing 12c for Oracle Data Integration", where customer and partner speakers, including SolarWorld, BT, Rittman Mead will join us in launching the new release. Resource Kits Meet Oracle Data Integration 12c Discover what's new with Oracle Goldengate 12c Oracle EMEA DIS (Data Integration Solutions) Partner Community is available for all your questions, while additional partner focused webcasts will be made available through our blog here, so stay connected. For any questions please contact us at partner.imc-AT-beehiveonline.oracle-DOT-com Stay Connected Oracle Newsletters

Read the article

Fast Data: Go Big. Go Fast.

- by Dain C. Hansen

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 For those of you who may have missed it, today’s second full day of Oracle OpenWorld 2012 started with a rumpus. Joe Tucci, from EMC outlined the human face of big data with real examples of how big data is transforming our world. And no not the usual tried-and-true weblog examples, but real stories about taxi cab drivers in Singapore using big data to better optimize their routes as well as folks just trying to get a better hair cut. Next we heard from Thomas Kurian who talked at length about the important platform characteristics of Oracle’s Cloud and more specifically Oracle’s expanded Cloud Services portfolio. Especially interesting to our integration customers are the messaging support for Oracle’s Cloud applications. What this means is that now Oracle’s Cloud applications have a lightweight integration fabric that on-premise applications can communicate to it via REST-APIs using Oracle SOA Suite. It’s an important element to our strategy at Oracle that supports this idea that whether your requirements are for private or public, Oracle has a solution in the Cloud for all of your applications and we give you more deployment choice than any vendor. If this wasn’t enough to get the juices flowing, later that morning we heard from Hasan Rizvi who outlined in his Fusion Middleware session the four most important enterprise imperatives: Social, Mobile, Cloud, and a brand new one: Fast Data. Today, Rizvi made an important step in the definition of this term to explain that he believes it’s a convergence of four essential technology elements: Event Processing for event filtering, business rules – with Oracle Event Processing Data Transformation and Loading - with Oracle Data Integrator Real-time replication and integration – with Oracle GoldenGate Analytics and data discovery – with Oracle Business Intelligence Each of these four elements can be considered (and architect-ed) together on a single integrated platform that can help customers integrate any type of data (structured, semi-structured) leveraging new styles of big data technologies (MapReduce, HDFS, Hive, NoSQL) to process more volume and variety of data at a faster velocity with greater results. Fast data processing (and especially real-time) has always been our credo at Oracle with each one of these products in Fusion Middleware. For example, Oracle GoldenGate continues to be made even faster with the recent 11g R2 Release of Oracle GoldenGate which gives us some even greater optimization to Oracle Database with Integrated Capture, as well as some new heterogeneity capabilities. With Oracle Data Integrator with Big Data Connectors, we’re seeing much improved performance by running MapReduce transformations natively on Hadoop systems. And with Oracle Event Processing we’re seeing some remarkable performance with customers like NTT Docomo. Check out their upcoming session at Oracle OpenWorld on Wednesday to hear more how this customer is using Event processing and Big Data together. If you missed any of these sessions and keynotes, not to worry. There's on-demand versions available on the Oracle OpenWorld website. You can also checkout our upcoming webcast where we will outline some of these new breakthroughs in Data Integration technologies for Big Data, Cloud, and Real-time in more details. /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif"; mso-bidi-font-family:"Times New Roman";}

Read the article

Why your Netapp is so slow...

- by Darius Zanganeh

Have you ever wondered why your Netapp FAS box is slow and doesn't perform well at large block workloads? In this blog entry I will give you a little bit of information that will probably help you understand why it’s so slow, why you shouldn't use it for applications that read and write in large blocks like 64k, 128k, 256k ++ etc.. Of course since I work for Oracle at this time, I will show you why the ZS3 storage boxes are excellent choices for these types of workloads. Netapp’s Fundamental Problem The fundamental problem you have running these workloads on Netapp is the backend block size of their WAFL file system. Every application block on a Netapp FAS ends up in a 4k chunk on a disk. Reference: Netapp TR-3001 Whitepaper Netapp has proven this lacking large block performance fact in at least two different ways. They have NEVER posted an SPC-2 Benchmark yet they have posted SPC-1 and SPECSFS, both recently. In 2011 they purchased Engenio to try and fill this GAP in their portfolio. Block Size Matters So why does block size matter anyways? Many applications use large block chunks of data especially in the Big Data movement. Some examples are SAS Business Analytics, Microsoft SQL, Hadoop HDFS is even 64MB! Now let me boil this down for you. If an application such MS SQL is writing data in a 64k chunk then before Netapp actually writes it on disk it will have to split it into 16 different 4k writes and 16 different disk IOPS. When the application later goes to read that 64k chunk the Netapp will have to again do 16 different disk IOPS. In comparison the ZS3 Storage Appliance can write in variable block sizes ranging from 512b to 1MB. So if you put the same MSSQL database on a ZS3 you can set the specific LUNs for this database to 64k and then when you do an application read/write it requires only a single disk IO. That is 16x faster! But, back to the problem with your Netapp, you will VERY quickly run out of disk IO and hit a wall. Now all arrays will have some fancy pre fetch algorithm and some nice cache and maybe even flash based cache such as a PAM card in your Netapp but with large block workloads you will usually blow through the cache and still need significant disk IO. Also because these datasets are usually very large and usually not dedupable they are usually not good candidates for an all flash system. You can do some simple math in excel and very quickly you will see why it matters. Here are a couple of READ examples using SAS and MSSQL. Assume these are the READ IOPS the application needs even after all the fancy cache and algorithms. Here is an example with 128k blocks. Notice the numbers of drives on the Netapp! Here is an example with 64k blocks You can easily see that the Oracle ZS3 can do dramatically more work with dramatically less drives. This doesn't even take into account that the ONTAP system will likely run out of CPU way before you get to these drive numbers so you be buying many more controllers. So with all that said, lets look at the ZS3 and why you should consider it for any workload your running on Netapp today. ZS3 World Record Price/Performance in the SPC-2 benchmark ZS3-2 is #1 in Price Performance $12.08ZS3-2 is #3 in Overall Performance 16,212 MBPS Note: The number one overall spot in the world is held by an AFA 33,477 MBPS but at a Price Performance of $29.79. A customer could purchase 2 x ZS3-2 systems in the benchmark with relatively the same performance and walk away with $600,000 in their pocket.

Read the article

Amazon Elastic MapReduce: Exception from FileSystem

- by S.N

Hi, I run my application using ruby client: ruby elastic-mapreduce -j j-20PEKMT9BRSUC --jar s3n://sakae55/lib/edu.cit.som.jar --main-class edu.cit.som.hadoop.SOMDriver --arg s3n://sakae55/repository/input/ecoli/ --arg s3n://sakae55/repository/output/ecoli/pl/ --arg s3n://sakae55/repository/data/ecoli/som.txt Then, I am seeing the following error: java.lang.IllegalArgumentException: This file system object (file:///) does not support access to the request path 'hdfs://i -10-195-207-230.ec2.internal:9000/mnt/var/lib/hadoop/tmp/mapred/system/job_201004221221_0017/job.jar' You possibly called Fi eSystem.get(conf) when you should of called FileSystem.get(uri, conf) to obtain a file system supporting your path. at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:52) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:416) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:259) at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:676) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:200) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1184) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1160) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1132) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:662) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:729) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026) at edu.cit.som.hadoop.SOMDriver.runIteration(SOMDriver.java:106) at edu.cit.som.hadoop.SOMDriver.train(SOMDriver.java:69) at edu.cit.som.hadoop.SOMDriver.run(SOMDriver.java:52) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at edu.cit.som.hadoop.SOMDriver.main(SOMDriver.java:36) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) I am not sure why the error references to "file:///" even though all the arguments I pass do not use the schema.

Read the article

Hadoop File Read

- by user3684584

Hadoop Distributed Cache Wordcount example in hadoop 2.2.0. Copied file into hdfs filesystem to be used inside setup of mapper class. protected void setup(Context context) throws IOException,InterruptedException { Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration()); cacheData=new HashMap(); for(Path urifile: uris) { try { BufferedReader readBuffer1 = new BufferedReader(new FileReader(urifile.toString())); String line; while ((line=readBuffer1.readLine())!=null) { System.out.println("**************"+line); cacheData.put(line,line); } readBuffer1.close(); } catch (Exception e) { System.out.println(e.toString()); } } } Inside Driver Main class Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs(); if (otherArgs.length != 3) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word_count"); job.setJarByClass(WordCount.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); Path outputpath=new Path(otherArgs[1]); outputpath.getFileSystem(conf).delete(outputpath,true); FileOutputFormat.setOutputPath(job,outputpath); System.out.println("CachePath****************"+otherArgs[2]); DistributedCache.addCacheFile(new URI(otherArgs[2]),job.getConfiguration()); System.exit(job.waitForCompletion(true) ? 0 : 1); But getting exception java.io.FileNotFoundException: file:/home/user12/tmp/mapred/local/1408960542382/cache (No such file or directory) So Cache functionality not working properly. Any Idea ?

Read the article

Hive performance increase

- by Sagar Nikam

I am dealing with a database (2.5 GB) having some tables only 40 row to some having 9 million rows data. when I am doing any query for large table it takes more time. I want results in less time small query on table which have 90 rows only-- hive> select count(*) from cidade; Time taken: 50.172 seconds hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.block.size</name> <value>131072</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration> does these setting affects performance of hive? dfs.replication=3 dfs.block.size=131072 can i set it from hive prompt as hive>set dfs.replication=5 Is this value remains for a perticular session only ? or Is it better to change it in .xml file ?

Search Results

Search found 80 results on 4 pages for 'hdfs'.

Page 3/4 | < Previous Page | 1 2 3 4 | Next Page >

- by dolan

- by Arenstar

- by Alex

- by linker

- by Pinal Dave

- by chberger

- by Mala Narasimharajan

- by Abhinav Sharma

- by portoalet

- by Deepak Konidena

- by Luca Martinetti

- by Algorist

- by Mat Keep

- by J Swaroop

- by Clint Edmonson

- by Irem Radzik

- by Ramesh

- by Tim van Elteren

- by Pinal Dave

- by Thanos Terentes Printzios

- by Dain C. Hansen

- by Darius Zanganeh

- by S.N

- by user3684584

- by Sagar Nikam

< Previous Page | 1 2 3 4 | Next Page >