Search Results

Search found 80 results on 4 pages for 'hdfs'.

Page 1/4 | 1 2 3 4  | Next Page >

  • Combining HBase and HDFS results in Exception in makeDirOnFileSystem

    - by utrecht
    Introduction An attempt to combine HBase and HDFS results in the following: 2014-06-09 00:15:14,777 WARN org.apache.hadoop.hbase.HBaseFileSystem: Create Dir ectory, retries exhausted 2014-06-09 00:15:14,780 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.io.IOException: Exception in makeDirOnFileSystem at org.apache.hadoop.hbase.HBaseFileSystem.makeDirOnFileSystem(HBaseFile System.java:136) at org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFi leSystem.java:428) at org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSyst emLayout(MasterFileSystem.java:148) at org.apache.hadoop.hbase.master.MasterFileSystem.<init>(MasterFileSyst em.java:133) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.j ava:572) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:432) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=hbase, access=WRITE, inode="/":vagrant:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPe rmissionChecker.java:224) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPe rmissionChecker.java:204) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermi ssion(FSPermissionChecker.java:149) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(F SNamesystem.java:4891) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(F SNamesystem.java:4873) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAcce ss(FSNamesystem.java:4847) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FS Namesystem.java:3192) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNames ystem.java:3156) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesyst em.java:3137) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameN odeRpcServer.java:669) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra nslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:419) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:4497 0) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal l(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1438) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct orAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC onstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce ption.java:90) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteExc eption.java:57) at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2153) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2122) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSy stem.java:545) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1915) at org.apache.hadoop.hbase.HBaseFileSystem.makeDirOnFileSystem(HBaseFile System.java:129) ... 6 more while configuration and system settings are as follows: [vagrant@localhost hadoop-hdfs]$ hadoop fs -ls hdfs://localhost/ Found 1 items -rw-r--r-- 3 vagrant supergroup 1010827264 2014-06-08 19:01 hdfs://localhost/u buntu-14.04-desktop-amd64.iso [vagrant@localhost hadoop-hdfs]$ /etc/hadoop/conf/core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> </property> </configuration> /etc/hbase/conf/hbase-site.xml <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:8020/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> </configuration> /etc/hadoop/conf/hdfs-site.xml <configuration> <property> <name>dfs.name.dir</name> <value>/var/lib/hadoop-hdfs/cache</value> </property> <property> <name>dfs.data.dir</name> <value>/tmp/hellodatanode</value> </property> </configuration> NameNode directory permissions [vagrant@localhost hadoop-hdfs]$ ls -ltr /var/lib/hadoop-hdfs/cache total 8 -rwxrwxrwx. 1 hbase hdfs 15 Jun 8 23:43 in_use.lock drwxrwxrwx. 2 hbase hdfs 4096 Jun 8 23:43 current [vagrant@localhost hadoop-hdfs]$ HMaster is able to start if fs.defaultFS property has been commented in core-site.xml NameNode is listening [vagrant@localhost hadoop-hdfs]$ netstat -nato | grep 50070 tcp 0 0 0.0.0.0:50070 0.0.0.0:* LIST EN off (0.00/0/0) tcp 0 0 33.33.33.33:50070 33.33.33.1:57493 ESTA BLISHED off (0.00/0/0) and accessible by navigating to http://33.33.33.33:50070/dfshealth.jsp. Question How to solve makeDirOnFileSystem exception and let HBase connect to HDFS?

    Read the article

  • Hadoop hdfs namenode is throwing an error

    - by KarmicDice
    Full list of error: hb@localhost:/etc/hadoop/conf$ sudo service hadoop-hdfs-namenode start * Starting Hadoop namenode: starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-localhost.out 12/09/10 14:41:09 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.0.0-cdh4.0.1 STARTUP_MSG: classpath = /etc/hadoop/conf:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/protobuf-java-2.4.0a.jar:/usr/lib/hadoop/lib/kfs-0.3.jar:/usr/lib/hadoop/lib/asm-3.2.jar:/usr/lib/hadoop/lib/commons-logging-api-1.1.jar:/usr/lib/hadoop/lib/jasper-compiler-5.5.23.jar:/usr/lib/hadoop/lib/stax-api-1.0.1.jar:/usr/lib/hadoop/lib/commons-configuration-1.6.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/hadoop/lib/jersey-server-1.8.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/hadoop/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop/lib/json-simple-1.1.jar:/usr/lib/hadoop/lib/snappy-java-1.0.3.2.jar:/usr/lib/hadoop/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/servlet-api-2.5.jar:/usr/lib/hadoop/lib/jackson-xc-1.8.8.jar:/usr/lib/hadoop/lib/jersey-json-1.8.jar:/usr/lib/hadoop/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop/lib/commons-el-1.0.jar:/usr/lib/hadoop/lib/slf4j-api-1.6.1.jar:/usr/lib/hadoop/lib/commons-collections-3.2.1.jar:/usr/lib/hadoop/lib/commons-logging-1.1.1.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/lib/jersey-core-1.8.jar:/usr/lib/hadoop/lib/commons-codec-1.4.jar:/usr/lib/hadoop/lib/jsr305-1.3.9.jar:/usr/lib/hadoop/lib/commons-cli-1.2.jar:/usr/lib/hadoop/lib/activation-1.1.jar:/usr/lib/hadoop/lib/jaxb-impl-2.2.3-1.jar:/usr/lib/hadoop/lib/jetty-util-6.1.26.cloudera.1.jar:/usr/lib/hadoop/lib/jasper-runtime-5.5.23.jar:/usr/lib/hadoop/lib/commons-beanutils-1.7.0.jar:/usr/lib/hadoop/lib/commons-lang-2.5.jar:/usr/lib/hadoop/lib/commons-digester-1.8.jar:/usr/lib/hadoop/lib/commons-io-2.1.jar:/usr/lib/hadoop/lib/jsp-api-2.1.jar:/usr/lib/hadoop/lib/guava-11.0.2.jar:/usr/lib/hadoop/lib/jetty-6.1.26.cloudera.1.jar:/usr/lib/hadoop/lib/jsch-0.1.42.jar:/usr/lib/hadoop/lib/zookeeper-3.4.3-cdh4.0.1.jar:/usr/lib/hadoop/lib/avro-1.5.4.jar:/usr/lib/hadoop/lib/core-3.1.1.jar:/usr/lib/hadoop/lib/paranamer-2.3.jar:/usr/lib/hadoop/lib/jettison-1.1.jar:/usr/lib/hadoop/lib/jackson-jaxrs-1.8.8.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar:/usr/lib/hadoop/lib/commons-beanutils-core-1.8.0.jar:/usr/lib/hadoop/lib/commons-net-3.1.jar:/usr/lib/hadoop/lib/jaxb-api-2.2.2.jar:/usr/lib/hadoop/lib/commons-math-2.1.jar:/usr/lib/hadoop/lib/jline-0.9.94.jar:/usr/lib/hadoop/.//hadoop-annotations.jar:/usr/lib/hadoop/.//hadoop-annotations-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop/.//hadoop-common-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop/.//hadoop-auth-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop/.//hadoop-common.jar:/usr/lib/hadoop/.//hadoop-auth.jar:/usr/lib/hadoop/.//hadoop-common-2.0.0-cdh4.0.1-tests.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/protobuf-java-2.4.0a.jar:/usr/lib/hadoop-hdfs/lib/snappy-java-1.0.3.2.jar:/usr/lib/hadoop-hdfs/lib/log4j-1.2.15.jar:/usr/lib/hadoop-hdfs/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop-hdfs/lib/slf4j-api-1.6.1.jar:/usr/lib/hadoop-hdfs/lib/commons-logging-1.1.1.jar:/usr/lib/hadoop-hdfs/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop-hdfs/lib/commons-daemon-1.0.3.jar:/usr/lib/hadoop-hdfs/lib/zookeeper-3.4.3-cdh4.0.1.jar:/usr/lib/hadoop-hdfs/lib/avro-1.5.4.jar:/usr/lib/hadoop-hdfs/lib/paranamer-2.3.jar:/usr/lib/hadoop-hdfs/lib/jline-0.9.94.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-2.0.0-cdh4.0.1-tests.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs.jar:/usr/lib/hadoop-yarn/lib/protobuf-java-2.4.0a.jar:/usr/lib/hadoop-yarn/lib/asm-3.2.jar:/usr/lib/hadoop-yarn/lib/netty-3.2.3.Final.jar:/usr/lib/hadoop-yarn/lib/javax.inject-1.jar:/usr/lib/hadoop-yarn/lib/jersey-server-1.8.jar:/usr/lib/hadoop-yarn/lib/jersey-guice-1.8.jar:/usr/lib/hadoop-yarn/lib/snappy-java-1.0.3.2.jar:/usr/lib/hadoop-yarn/lib/log4j-1.2.15.jar:/usr/lib/hadoop-yarn/lib/guice-3.0.jar:/usr/lib/hadoop-yarn/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop-yarn/lib/junit-4.8.2.jar:/usr/lib/hadoop-yarn/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop-yarn/lib/jersey-core-1.8.jar:/usr/lib/hadoop-yarn/lib/jdiff-1.0.9.jar:/usr/lib/hadoop-yarn/lib/guice-servlet-3.0.jar:/usr/lib/hadoop-yarn/lib/aopalliance-1.0.jar:/usr/lib/hadoop-yarn/lib/commons-io-2.1.jar:/usr/lib/hadoop-yarn/lib/avro-1.5.4.jar:/usr/lib/hadoop-yarn/lib/paranamer-2.3.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-web-proxy.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-nodemanager.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-resourcemanager-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-common.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-common.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-applications-distributedshell-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-web-proxy-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-api.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-resourcemanager.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-common-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-nodemanager-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-site.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-api-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-common-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-applications-distributedshell.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-site-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop-mapreduce/.//* STARTUP_MSG: build = file:///var/lib/jenkins/workspace/generic-package-ubuntu64-12-04/CDH4.0.1-Packaging-Hadoop-2012-06-28_17-01-57/hadoop-2.0.0+91-1.cdh4.0.1.p0.1~precise/src/hadoop-common-project/hadoop-common -r 4d98eb718ec0cce78a00f292928c5ab6e1b84695; compiled by 'jenkins' on Thu Jun 28 17:39:19 PDT 2012 ************************************************************/ 12/09/10 14:41:10 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-namenode.properties,hadoop-metrics2.properties hdfs-site.xml: hb@localhost:/etc/hadoop/conf$ cat hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <!--Autogenerated by Cloudera CM on 2012-09-03T10:13:30.628Z--> <configuration> <property> <name>dfs.https.address</name> <value>localhost:50470</value> </property> <property> <name>dfs.https.port</name> <value>50470</value> </property> <property> <name>dfs.namenode.http-address</name> <value>localhost:50070</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.blocksize</name> <value>134217728</value> </property> <property> <name>dfs.client.use.datanode.hostname</name> <value>false</value> </property> </configuration>

    Read the article

  • Running a simple integration scenario using the Oracle Big Data Connectors on Hadoop/HDFS cluster

    - by hamsun
    Between the elephant ( the tradional image of the Hadoop framework) and the Oracle Iron Man (Big Data..) an english setter could be seen as the link to the right data Data, Data, Data, we are living in a world where data technology based on popular applications , search engines, Webservers, rich sms messages, email clients, weather forecasts and so on, have a predominant role in our life. More and more technologies are used to analyze/track our behavior, try to detect patterns, to propose us "the best/right user experience" from the Google Ad services, to Telco companies or large consumer sites (like Amazon:) ). The more we use all these technologies, the more we generate data, and thus there is a need of huge data marts and specific hardware/software servers (as the Exadata servers) in order to treat/analyze/understand the trends and offer new services to the users. Some of these "data feeds" are raw, unstructured data, and cannot be processed effectively by normal SQL queries. Large scale distributed processing was an emerging infrastructure need and the solution seemed to be the "collocation of compute nodes with the data", which in turn leaded to MapReduce parallel patterns and the development of the Hadoop framework, which is based on MapReduce and a distributed file system (HDFS) that runs on larger clusters of rather inexpensive servers. Several Oracle products are using the distributed / aggregation pattern for data calculation ( Coherence, NoSql, times ten ) so once that you are familiar with one of these technologies, lets says with coherence aggregators, you will find the whole Hadoop, MapReduce concept very similar. Oracle Big Data Appliance is based on the Cloudera Distribution (CDH), and the Oracle Big Data Connectors can be plugged on a Hadoop cluster running the CDH distribution or equivalent Hadoop clusters. In this paper, a "lab like" implementation of this concept is done on a single Linux X64 server, running an Oracle Database 11g Enterprise Edition Release 11.2.0.4.0, and a single node Apache hadoop-1.2.1 HDFS cluster, using the SQL connector for HDFS. The whole setup is fairly simple: Install on a Linux x64 server ( or virtual box appliance) an Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 server Get the Apache Hadoop distribution from: http://mir2.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-1.2.1. Get the Oracle Big Data Connectors from: http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html?ssSourceSiteId=ocomen. Check the java version of your Linux server with the command: java -version java version "1.7.0_40" Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) Decompress the hadoop hadoop-1.2.1.tar.gz file to /u01/hadoop-1.2.1 Modify your .bash_profile export HADOOP_HOME=/u01/hadoop-1.2.1 export PATH=$PATH:$HADOOP_HOME/bin export HIVE_HOME=/u01/hive-0.11.0 export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin (also see my sample .bash_profile) Set up ssh trust for Hadoop process, this is a mandatory step, in our case we have to establish a "local trust" as will are using a single node configuration copy the new public keys to the list of authorized keys connect and test the ssh setup to your localhost: We will run a "pseudo-Hadoop cluster", in what is called "local standalone mode", all the Hadoop java components are running in one Java process, this is enough for our demo purposes. We need to "fine tune" some Hadoop configuration files, we have to go at our $HADOOP_HOME/conf, and modify the files: core-site.xml hdfs-site.xml mapred-site.xml check that the hadoop binaries are referenced correctly from the command line by executing: hadoop -version As Hadoop is managing our "clustered HDFS" file system we have to create "the mount point" and format it , the mount point will be declared to core-site.xml as: The layout under the /u01/hadoop-1.2.1/data will be created and used by other hadoop components (MapReduce = /mapred/...) HDFS is using the /dfs/... layout structure format the HDFS hadoop file system: Start the java components for the HDFS system As an additional check, you can use the GUI Hadoop browsers to check the content of your HDFS configurations: Once our HDFS Hadoop setup is done you can use the HDFS file system to store data ( big data : )), and plug them back and forth to Oracle Databases by the means of the Big Data Connectors ( which is the next configuration step). You can create / use a Hive db, but in our case we will make a simple integration of "raw data" , through the creation of an External Table to a local Oracle instance ( on the same Linux box, we run the Hadoop HDFS one node cluster and one Oracle DB). Download some public "big data", I use the site: http://france.meteofrance.com/france/observations, from where I can get *.csv files for my big data simulations :). Here is the data layout of my example file: Download the Big Data Connector from the OTN (oraosch-2.2.0.zip), unzip it to your local file system (see picture below) Modify your environment in order to access the connector libraries , and make the following test: [oracle@dg1 bin]$./hdfs_stream Usage: hdfs_stream locationFile [oracle@dg1 bin]$ Load the data to the Hadoop hdfs file system: hadoop fs -mkdir bgtest_data hadoop fs -put obsFrance.txt bgtest_data/obsFrance.txt hadoop fs -ls /user/oracle/bgtest_data/obsFrance.txt [oracle@dg1 bg-data-raw]$ hadoop fs -ls /user/oracle/bgtest_data/obsFrance.txt Found 1 items -rw-r--r-- 1 oracle supergroup 54103 2013-10-22 06:10 /user/oracle/bgtest_data/obsFrance.txt [oracle@dg1 bg-data-raw]$hadoop fs -ls hdfs:///user/oracle/bgtest_data/obsFrance.txt Found 1 items -rw-r--r-- 1 oracle supergroup 54103 2013-10-22 06:10 /user/oracle/bgtest_data/obsFrance.txt Check the content of the HDFS with the browser UI: Start the Oracle database, and run the following script in order to create the Oracle database user, the Oracle directories for the Oracle Big Data Connector (dg1 it’s my own db id replace accordingly yours): #!/bin/bash export ORAENV_ASK=NO export ORACLE_SID=dg1 . oraenv sqlplus /nolog <<EOF CONNECT / AS sysdba; CREATE OR REPLACE DIRECTORY osch_bin_path AS '/u01/orahdfs-2.2.0/bin'; CREATE USER BGUSER IDENTIFIED BY oracle; GRANT CREATE SESSION, CREATE TABLE TO BGUSER; GRANT EXECUTE ON sys.utl_file TO BGUSER; GRANT READ, EXECUTE ON DIRECTORY osch_bin_path TO BGUSER; CREATE OR REPLACE DIRECTORY BGT_LOG_DIR as '/u01/BG_TEST/logs'; GRANT READ, WRITE ON DIRECTORY BGT_LOG_DIR to BGUSER; CREATE OR REPLACE DIRECTORY BGT_DATA_DIR as '/u01/BG_TEST/data'; GRANT READ, WRITE ON DIRECTORY BGT_DATA_DIR to BGUSER; EOF Put the following in a file named t3.sh and make it executable, hadoop jar $OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=BGTEST_DP_XTAB \ -D oracle.hadoop.exttab.defaultDirectory=BGT_DATA_DIR \ -D oracle.hadoop.exttab.dataPaths="hdfs:///user/oracle/bgtest_data/obsFrance.txt" \ -D oracle.hadoop.exttab.columnCount=7 \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@//localhost:1521/dg1 \ -D oracle.hadoop.connection.user=BGUSER \ -D oracle.hadoop.exttab.printStackTrace=true \ -createTable --noexecute then test the creation fo the external table with it: [oracle@dg1 samples]$ ./t3.sh ./t3.sh: line 2: /u01/orahdfs-2.2.0: Is a directory Oracle SQL Connector for HDFS Release 2.2.0 - Production Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved. Enter Database Password:] The create table command was not executed. The following table would be created. CREATE TABLE "BGUSER"."BGTEST_DP_XTAB" ( "C1" VARCHAR2(4000), "C2" VARCHAR2(4000), "C3" VARCHAR2(4000), "C4" VARCHAR2(4000), "C5" VARCHAR2(4000), "C6" VARCHAR2(4000), "C7" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "BGT_DATA_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 STRING SIZES ARE IN CHARACTERS PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'2C' MISSING FIELD VALUES ARE NULL ( "C1" CHAR(4000), "C2" CHAR(4000), "C3" CHAR(4000), "C4" CHAR(4000), "C5" CHAR(4000), "C6" CHAR(4000), "C7" CHAR(4000) ) ) LOCATION ( 'osch-20131022081035-74-1' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files would be created. osch-20131022081035-74-1 contains 1 URI, 54103 bytes 54103 hdfs://localhost:19000/user/oracle/bgtest_data/obsFrance.txt Then remove the --noexecute flag and create the external Oracle table for the Hadoop data. Check the results: The create table command succeeded. CREATE TABLE "BGUSER"."BGTEST_DP_XTAB" ( "C1" VARCHAR2(4000), "C2" VARCHAR2(4000), "C3" VARCHAR2(4000), "C4" VARCHAR2(4000), "C5" VARCHAR2(4000), "C6" VARCHAR2(4000), "C7" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "BGT_DATA_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 STRING SIZES ARE IN CHARACTERS PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'2C' MISSING FIELD VALUES ARE NULL ( "C1" CHAR(4000), "C2" CHAR(4000), "C3" CHAR(4000), "C4" CHAR(4000), "C5" CHAR(4000), "C6" CHAR(4000), "C7" CHAR(4000) ) ) LOCATION ( 'osch-20131022081719-3239-1' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files were created. osch-20131022081719-3239-1 contains 1 URI, 54103 bytes 54103 hdfs://localhost:19000/user/oracle/bgtest_data/obsFrance.txt This is the view from the SQL Developer: and finally the number of lines in the oracle table, imported from our Hadoop HDFS cluster SQL select count(*) from "BGUSER"."BGTEST_DP_XTAB"; COUNT(*) ---------- 1151 In a next post we will integrate data from a Hive database, and try some ODI integrations with the ODI Big Data connector. Our simplistic approach is just a step to show you how these unstructured data world can be integrated to Oracle infrastructure. Hadoop, BigData, NoSql are great technologies, they are widely used and Oracle is offering a large integration infrastructure based on these services. Oracle University presents a complete curriculum on all the Oracle related technologies: NoSQL: Introduction to Oracle NoSQL Database Using Oracle NoSQL Database Big Data: Introduction to Big Data Oracle Big Data Essentials Oracle Big Data Overview Oracle Data Integrator: Oracle Data Integrator 12c: New Features Oracle Data Integrator 11g: Integration and Administration Oracle Data Integrator: Administration and Development Oracle Data Integrator 11g: Advanced Integration and Development Oracle Coherence 12c: Oracle Coherence 12c: New Features Oracle Coherence 12c: Share and Manage Data in Clusters Oracle Coherence 12c: Oracle GoldenGate 11g: Fundamentals for Oracle Oracle GoldenGate 11g: Fundamentals for SQL Server Oracle GoldenGate 11g Fundamentals for Oracle Oracle GoldenGate 11g Fundamentals for DB2 Oracle GoldenGate 11g Fundamentals for Teradata Oracle GoldenGate 11g Fundamentals for HP NonStop Oracle GoldenGate 11g Management Pack: Overview Oracle GoldenGate 11g Troubleshooting and Tuning Oracle GoldenGate 11g: Advanced Configuration for Oracle Other Resources: Apache Hadoop : http://hadoop.apache.org/ is the homepage for these technologies. "Hadoop Definitive Guide 3rdEdition" by Tom White is a classical lecture for people who want to know more about Hadoop , and some active "googling " will also give you some more references. About the author: Eugene Simos is based in France and joined Oracle through the BEA-Weblogic Acquisition, where he worked for the Professional Service, Support, end Education for major accounts across the EMEA Region. He worked in the banking sector, ATT, Telco companies giving him extensive experience on production environments. Eugen currently specializes in Oracle Fusion Middleware teaching an array of courses on Weblogic/Webcenter, Content,BPM /SOA/Identity-Security/GoldenGate/Virtualisation/Unified Comm Suite) throughout the EMEA region.

    Read the article

  • Big Data – Buzz Words: What is HDFS – Day 8 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned what is MapReduce. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – HDFS. What is HDFS ? HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure. HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system. Architecture of HDFS The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes. The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode. In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language. Visual Representation of HDFS Architecture Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode. High Availability During Disaster Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability. If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails. The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware. Tomorrow In tomorrow’s blog post we will discuss the importance of the relational database in Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • HDFS some datanodes of cluster are suddenly disconnected while reducers are running

    - by user1429825
    I have 8 slave computers and 1 master computer for running Hadoop (ver 0.21) some datanodes of cluster are suddenly disconnected while I was running MapReduce code on 10GB data After all mappers finished and around 80% of reducers was processed, randomly one or more datanode disconned from network. and then the other datanodes start to disappear from network even if I killed the MapReduce job when I found some datanode was disconnected. I've tried to change dfs.datanode.max.xcievers to 4096, turned off fire-walls of all computing node, disabled selinux and increased the number of file open limit to 20000 but they didn't work at all... anyone have a idea to solve this problem? followings are error log from mapreduce 12/06/01 12:31:29 INFO mapreduce.Job: Task Id : attempt_201206011227_0001_r_000006_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink as ***.***.***.148:20010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) and followings are logs from datanode 2012-06-01 13:01:01,118 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5549263231281364844_3453 src: /*.*.*.147:56205 dest: /*.*.*.142:20010 2012-06-01 13:01:01,136 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020) Starting thread to transfer block blk_-3849519151985279385_5906 to *.*.*.147:20010 2012-06-01 13:01:19,135 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-5797481564121417802_3453 to *.*.*.146:20010 got java.net.ConnectException: > Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1257) at java.lang.Thread.run(Thread.java:722) 2012-06-01 13:06:20,342 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_6674438989226364081_3453 2012-06-01 13:09:01,781 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-3849519151985279385_5906 to *.*.*.147:20010 got java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/*.*.*.142:60057 remote=/*.*.*.147:20010] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:203) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:388) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:476) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1284) at java.lang.Thread.run(Thread.java:722) hdfs-site.xml <configuration> <property> <name>dfs.name.dir</name> <value>/home/hadoop/data/name</value> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/data/hdfs1,/home/hadoop/data/hdfs2,/home/hadoop/data/hdfs3,/home/hadoop/data/hdfs4,/home/hadoop/data/hdfs5</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> <property> <name>dfs.http.address</name> <value>0.0.0.0:20070</value> <description>50070 The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:20075</value> <description>50075 The datanode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.secondary.http.address</name> <value>0.0.0.0:20090</value> <description>50090 The secondary namenode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:20010</value> <description>50010 The address where the datanode server will listen to. If the port is 0 then the server will start on a free port. </description> <property> <name>dfs.datanode.ipc.address</name> <value>0.0.0.0:20020</value> <description>50020 The datanode ipc server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.https.address</name> <value>0.0.0.0:20475</value> </property> <property> <name>dfs.https.address</name> <value>0.0.0.0:20470</value> </property> </configuration> mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>masternode:29001</value> </property> <property> <name>mapred.system.dir</name> <value>/home/hadoop/data/mapreduce/system</value> </property> <property> <name>mapred.local.dir</name> <value>/home/hadoop/data/mapreduce/local</value> </property> <property> <name>mapred.map.tasks</name> <value>32</value> <description> default number of map tasks per job.</description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> </property> <property> <name>mapred.reduce.tasks</name> <value>8</value> <description> default number of reduce tasks per job.</description> </property> <property> <name>mapred.map.child.java.opts</name> <value>-Xmx2048M</value> </property> <property> <name>io.sort.mb</name> <value>500</value> </property> <property> <name>mapred.task.timeout</name> <value>1800000</value> <!-- 30 minutes --> </property> <property> <name>mapred.job.tracker.http.address</name> <value>0.0.0.0:20030</value> <description> 50030 The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>mapred.task.tracker.http.address</name> <value>0.0.0.0:20060</value> <description> 50060 </property> </configuration>

    Read the article

  • HDFS datanode startup fails when disks are full

    - by mbac
    Our HDFS cluster is only 90% full but some datanodes have some disks that are 100% full. That means when we mass reboot the entire cluster some datanodes completely fail to start with a message like this: 2013-10-26 03:58:27,295 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Mkdirs failed to create /mnt/local/sda1/hadoop/dfsdata/blocksBeingWritten Only three have to fail this way before we start experiencing real data loss. Currently we workaround it by decreasing the amount of space reserved for the root user but we'll eventually run out. We also run the re-balancer pretty much constantly, but some disks stay stuck at 100% anyway. Changing the dfs.datanode.failed.volumes.tolerated setting is not the solution as the volume has not failed. Any ideas?

    Read the article

  • hdfs configuration

    - by Ananymous
    I am a newbie. Trying to setup a hdfs system to serve my data (I don't plan to use mapreduce) at my lab. So far I have read, cluster setup in but I am still confused. Several questions: Do I need to have a secondary namenode? There are 2 files, masters and slaves. Do I really need these 2 files eventhough I just want hdfs? If I need them, what should go in there? I assume my namenode in masters and datanodes as slaves? Do I need slaves nodes What configuration files are needed for namenode, secondary namenode, datanode and client? (I assume core-site.xml is needed for all 4)? In addition, can someone suggest a good configuration model? sample configuration for namenode, secondary namenode, datanode, and the client would be very helpful. I am getting confused because it seems most of the documentation assumes I want to use map-reduce which isn't the case.

    Read the article

  • no namenode error in pseudo-mode

    - by Anshu Basia
    I'm new to hadoop and is in learning phase. As per Hadoop Definitve guide, i have set up my hadoop in pseudo distributed mode and everything was working fine. I was even able to execute all the examples from chapter 3 yesterday. Today, when i rebooted my unix and tried to run start-dfs.sh and then tried http://localhost/50070....it is showing error and when i try to stop dfs (stop-dfs.sh) it says no namenode to stop. I have been googling the issue but no result. Also, when i format my namenode again...everything starts working fine and i'm able to connect to the localhost/50070 and even replicate files and directories in hdfs but as soon as i restart my linux and try to connect to hdfs the same problem comes up. Below is the error log: ************************************************************/ 2011-06-22 15:45:55,249 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.203.0 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011 ************************************************************/ 2011-06-22 15:45:56,383 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2011-06-22 15:45:56,455 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2011-06-22 15:45:56,494 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2011-06-22 15:45:56,494 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started 2011-06-22 15:45:57,007 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2011-06-22 15:45:57,031 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2011-06-22 15:45:57,059 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered. 2011-06-22 15:45:57,070 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source NameNode registered. 2011-06-22 15:45:57,374 INFO org.apache.hadoop.hdfs.util.GSet: VM type = 32-bit 2011-06-22 15:45:57,374 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 19.33375 MB 2011-06-22 15:45:57,374 INFO org.apache.hadoop.hdfs.util.GSet: capacity = 2^22 = 4194304 entries 2011-06-22 15:45:57,374 INFO org.apache.hadoop.hdfs.util.GSet: recommended=4194304, actual=4194304 2011-06-22 15:45:57,854 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=anshu 2011-06-22 15:45:57,854 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-06-22 15:45:57,854 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-06-22 15:45:57,868 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100 2011-06-22 15:45:57,869 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 2011-06-22 15:45:58,769 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean 2011-06-22 15:45:58,809 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times **2011-06-22 15:45:58,825 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /tmp/hadoop-anshu/dfs/name does not exist. 2011-06-22 15:45:58,827 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-anshu/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.h**adoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:291) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:353) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:434) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162) 2011-06-22 15:45:58,828 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-anshu/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:291) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:353) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:434) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162) 2011-06-22 15:45:58,829 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ Any help is appreciated Thank-you

    Read the article

  • What is meant by "streaming data access" in HDFS?

    - by Van Gale
    According to the HDFS Architecture page HDFS was designed for "streaming data access". I'm not sure what that means exactly, but would guess it means an operation like seek is either disabled or has sub-optimal performance. Would this be correct? I'm interested in using HDFS for storing audio/video files that need to be streamed to browser clients. Most of the streams will be start to finish, but some could have a high number of seeks. Maybe there is another file system that could do this better?

    Read the article

  • Does changing the default HDFS replication factor from 3 affect mapper performance?

    - by liamf
    Have a HDFS/Hadoop cluster setup and am looking into tuning. I wonder if changing the default HDFS replication factor (default:3) to something bigger will improve mapper performance, at the obvious expense of increasing disk storage used? My reasoning being that if the data is already replicated to more nodes, mapper jobs can be run on more nodes in parallel without any data streaming/copying? Anyone got any opinions?

    Read the article

  • Running Hadoop example in psuedo-distributed mode on vm

    - by manas
    I have set-up Hadoop on a OpenSuse 11.2 VM using Virtualbox.I have made the prerequisite configs. I ran this example in the Standalone mode successfully. But in psuedo-distributed mode I get the following error: $./bin/hadoop fs -put conf input 10/04/13 15:56:25 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: Protocol not available 10/04/13 15:56:25 INFO hdfs.DFSClient: Abandoning block blk_-8490915989783733314_1003 10/04/13 15:56:31 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: Protocol not available 10/04/13 15:56:31 INFO hdfs.DFSClient: Abandoning block blk_-1740343312313498323_1003 10/04/13 15:56:37 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: Protocol not available 10/04/13 15:56:37 INFO hdfs.DFSClient: Abandoning block blk_-3566235190507929459_1003 10/04/13 15:56:43 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketException: Protocol not available 10/04/13 15:56:43 INFO hdfs.DFSClient: Abandoning block blk_-1746222418910980888_1003 10/04/13 15:56:49 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) 10/04/13 15:56:49 WARN hdfs.DFSClient: Error Recovery for block blk_-1746222418910980888_1003 bad datanode[0] nodes == null 10/04/13 15:56:49 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/max/input/core-site.xml" - Aborting... put: Protocol not available 10/04/13 15:56:49 ERROR hdfs.DFSClient: Exception closing file /user/max/input/core-site.xml : java.net.SocketException: Protocol not available java.net.SocketException: Protocol not available at sun.nio.ch.Net.getIntOption0(Native Method) at sun.nio.ch.Net.getIntOption(Net.java:178) at sun.nio.ch.SocketChannelImpl$1.getInt(SocketChannelImpl.java:419) at sun.nio.ch.SocketOptsImpl.getInt(SocketOptsImpl.java:60) at sun.nio.ch.SocketOptsImpl.sendBufferSize(SocketOptsImpl.java:156) at sun.nio.ch.SocketOptsImpl$IP$TCP.sendBufferSize(SocketOptsImpl.java:286) at sun.nio.ch.OptionAdaptor.getSendBufferSize(OptionAdaptor.java:129) at sun.nio.ch.SocketAdaptor.getSendBufferSize(SocketAdaptor.java:328) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2873) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) Any leads will be highly appreciated.

    Read the article

  • Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3.

    - by Deepak Konidena
    Hi, I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command bin/hadoop jar <program>.jar <arg1> <arg2> <path/to/input/file/on/S3> It throws the following errors (not at the same time.) The first error is thrown when i don't replace the slashes with '%2F' and the second is thrown when i replace them with '%2F': 1) Java.lang.IllegalArgumentException: Invalid hostname in URI S3://<ID>:<SECRETKEY>@<BUCKET>/<path-to-inputfile> 2) org.apache.hadoop.fs.S3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/' XML Error Message: The request signature we calculated does not match the signature you provided. check your key and signing method. Note: 1)when i submitted jps to see what tasks were running on the Master, it just showed 1116 NameNode 1699 Jps 1180 JobTracker leaving DataNode and TaskTracker. 2)My Secret key contains two '/' (forward slashes). And i replace them with '%2F' in the S3 URI. PS: The program runs fine on EC2 when run on a single node. Its only when i launch a cluster, i run into issues related to copying data to/from S3 from/to HDFS. And, what does distcp do? Do i need to distribute the data even after i copy the data from S3 to HDFS?(I thought, HDFS took care of that internally) IF you could direct me to a link that explains running Map/reduce programs on a hadoop cluster using Amazon EC2/S3. That would be great. Regards, Deepak.

    Read the article

  • CouchDB, HDFS, HBase or which is right for my situation?

    - by Lucas
    Hello all, This question is regarding data storage systems such as CouchDB, HDFS and HBase, specifically, which is right. I am looking at making a simple and customized Document Management System for my organization. Basically, we need the ability to store some Word Documents, PDFs and other similar files. I also want to store metadata about these files (e.g., Author, Dates, etc). Usage permissions would also be handy, but that can probably be built using meta-data. I would also need the ability to full-text index. The ability to version, while not required would be extremely useful. I would like the ability to simply add hardware to expand the resources of the system and the system must support Network Attached Storage over the CIFS or NFS protocol(s). I have read about CouchDB, HDFS and HBase. My preferred programming language is C# as all of my end-users will be running Windows machines and I will want to make both web and winforms client implementations. My question is which solution best fits my needs? Based on my research it appears that CouchDB (utilizing the CouchDB-Lounge and CouchDB-Lucene) perfectly fits my needs. However, I am worried that since I have worked with CouchDB that I might be overlooking something useful for my needs in HDFS or HBase or something similar due to a bias. Any and all opinions are welcome as I am looking for the community input as I really do not want to make the wrong choice at the start of my project. Please ask if you need more information. I thank you all for your time, input and assistance.

    Read the article

  • How to Load Oracle Tables From Hadoop Tutorial (Part 5 - Leveraging Parallelism in OSCH)

    - by Bob Hanckel
    Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 Using OSCH: Beyond Hello World In the previous post we discussed a “Hello World” example for OSCH focusing on the mechanics of getting a toy end-to-end example working. In this post we are going to talk about how to make it work for big data loads. We will explain how to optimize an OSCH external table for load, paying particular attention to Oracle’s DOP (degree of parallelism), the number of external table location files we use, and the number of HDFS files that make up the payload. We will provide some rules that serve as best practices when using OSCH. The assumption is that you have read the previous post and have some end to end OSCH external tables working and now you want to ramp up the size of the loads. Using OSCH External Tables for Access and Loading OSCH external tables are no different from any other Oracle external tables.  They can be used to access HDFS content using Oracle SQL: SELECT * FROM my_hdfs_external_table; or use the same SQL access to load a table in Oracle. INSERT INTO my_oracle_table SELECT * FROM my_hdfs_external_table; To speed up the load time, you will want to control the degree of parallelism (i.e. DOP) and add two SQL hints. ALTER SESSION FORCE PARALLEL DML PARALLEL  8; ALTER SESSION FORCE PARALLEL QUERY PARALLEL 8; INSERT /*+ append pq_distribute(my_oracle_table, none) */ INTO my_oracle_table SELECT * FROM my_hdfs_external_table; There are various ways of either hinting at what level of DOP you want to use.  The ALTER SESSION statements above force the issue assuming you (the user of the session) are allowed to assert the DOP (more on that in the next section).  Alternatively you could embed additional parallel hints directly into the INSERT and SELECT clause respectively. /*+ parallel(my_oracle_table,8) *//*+ parallel(my_hdfs_external_table,8) */ Note that the "append" hint lets you load a target table by reserving space above a given "high watermark" in storage and uses Direct Path load.  In other doesn't try to fill blocks that are already allocated and partially filled. It uses unallocated blocks.  It is an optimized way of loading a table without incurring the typical resource overhead associated with run-of-the-mill inserts.  The "pq_distribute" hint in this context unifies the INSERT and SELECT operators to make data flow during a load more efficient. Finally your target Oracle table should be defined with "NOLOGGING" and "PARALLEL" attributes.   The combination of the "NOLOGGING" and use of the "append" hint disables REDO logging, and its overhead.  The "PARALLEL" clause tells Oracle to try to use parallel execution when operating on the target table. Determine Your DOP It might feel natural to build your datasets in Hadoop, then afterwards figure out how to tune the OSCH external table definition, but you should start backwards. You should focus on Oracle database, specifically the DOP you want to use when loading (or accessing) HDFS content using external tables. The DOP in Oracle controls how many PQ slaves are launched in parallel when executing an external table. Typically the DOP is something you want to Oracle to control transparently, but for loading content from Hadoop with OSCH, it's something that you will want to control. Oracle computes the maximum DOP that can be used by an Oracle user. The maximum value that can be assigned is an integer value typically equal to the number of CPUs on your Oracle instances, times the number of cores per CPU, times the number of Oracle instances. For example, suppose you have a RAC environment with 2 Oracle instances. And suppose that each system has 2 CPUs with 32 cores. The maximum DOP would be 128 (i.e. 2*2*32). In point of fact if you are running on a production system, the maximum DOP you are allowed to use will be restricted by the Oracle DBA. This is because using a system maximum DOP can subsume all system resources on Oracle and starve anything else that is executing. Obviously on a production system where resources need to be shared 24x7, this can’t be allowed to happen. The use cases for being able to run OSCH with a maximum DOP are when you have exclusive access to all the resources on an Oracle system. This can be in situations when your are first seeding tables in a new Oracle database, or there is a time where normal activity in the production database can be safely taken off-line for a few hours to free up resources for a big incremental load. Using OSCH on high end machines (specifically Oracle Exadata and Oracle BDA cabled with Infiniband), this mode of operation can load up to 15TB per hour. The bottom line is that you should first figure out what DOP you will be allowed to run with by talking to the DBAs who manage the production system. You then use that number to derive the number of location files, and (optionally) the number of HDFS data files that you want to generate, assuming that is flexible. Rule 1: Find out the maximum DOP you will be allowed to use with OSCH on the target Oracle system Determining the Number of Location Files Let’s assume that the DBA told you that your maximum DOP was 8. You want the number of location files in your external table to be big enough to utilize all 8 PQ slaves, and you want them to represent equally balanced workloads. Remember location files in OSCH are metadata lists of HDFS files and are created using OSCH’s External Table tool. They also represent the workload size given to an individual Oracle PQ slave (i.e. a PQ slave is given one location file to process at a time, and only it will process the contents of the location file.) Rule 2: The size of the workload of a single location file (and the PQ slave that processes it) is the sum of the content size of the HDFS files it lists For example, if a location file lists 5 HDFS files which are each 100GB in size, the workload size for that location file is 500GB. The number of location files that you generate is something you control by providing a number as input to OSCH’s External Table tool. Rule 3: The number of location files chosen should be a small multiple of the DOP Each location file represents one workload for one PQ slave. So the goal is to keep all slaves busy and try to give them equivalent workloads. Obviously if you run with a DOP of 8 but have 5 location files, only five PQ slaves will have something to do and the other three will have nothing to do and will quietly exit. If you run with 9 location files, then the PQ slaves will pick up the first 8 location files, and assuming they have equal work loads, will finish up about the same time. But the first PQ slave to finish its job will then be rescheduled to process the ninth location file, potentially doubling the end to end processing time. So for this DOP using 8, 16, or 32 location files would be a good idea. Determining the Number of HDFS Files Let’s start with the next rule and then explain it: Rule 4: The number of HDFS files should try to be a multiple of the number of location files and try to be relatively the same size In our running example, the DOP is 8. This means that the number of location files should be a small multiple of 8. Remember that each location file represents a list of unique HDFS files to load, and that the sum of the files listed in each location file is a workload for one Oracle PQ slave. The OSCH External Table tool will look in an HDFS directory for a set of HDFS files to load.  It will generate N number of location files (where N is the value you gave to the tool). It will then try to divvy up the HDFS files and do its best to make sure the workload across location files is as balanced as possible. (The tool uses a greedy algorithm that grabs the biggest HDFS file and delegates it to a particular location file. It then looks for the next biggest file and puts in some other location file, and so on). The tools ability to balance is reduced if HDFS file sizes are grossly out of balance or are too few. For example suppose my DOP is 8 and the number of location files is 8. Suppose I have only 8 HDFS files, where one file is 900GB and the others are 100GB. When the tool tries to balance the load it will be forced to put the singleton 900GB into one location file, and put each of the 100GB files in the 7 remaining location files. The load balance skew is 9 to 1. One PQ slave will be working overtime, while the slacker PQ slaves are off enjoying happy hour. If however the total payload (1600 GB) were broken up into smaller HDFS files, the OSCH External Table tool would have an easier time generating a list where each workload for each location file is relatively the same.  Applying Rule 4 above to our DOP of 8, we could divide the workload into160 files that were approximately 10 GB in size.  For this scenario the OSCH External Table tool would populate each location file with 20 HDFS file references, and all location files would have similar workloads (approximately 200GB per location file.) As a rule, when the OSCH External Table tool has to deal with more and smaller files it will be able to create more balanced loads. How small should HDFS files get? Not so small that the HDFS open and close file overhead starts having a substantial impact. For our performance test system (Exadata/BDA with Infiniband), I compared three OSCH loads of 1 TiB. One load had 128 HDFS files living in 64 location files where each HDFS file was about 8GB. I then did the same load with 12800 files where each HDFS file was about 80MB size. The end to end load time was virtually the same. However when I got ridiculously small (i.e. 128000 files at about 8MB per file), it started to make an impact and slow down the load time. What happens if you break rules 3 or 4 above? Nothing draconian, everything will still function. You just won’t be taking full advantage of the generous DOP that was allocated to you by your friendly DBA. The key point of the rules articulated above is this: if you know that HDFS content is ultimately going to be loaded into Oracle using OSCH, it makes sense to chop them up into the right number of files roughly the same size, derived from the DOP that you expect to use for loading. Next Steps So far we have talked about OLH and OSCH as alternative models for loading. That’s not quite the whole story. They can be used together in a way that provides for more efficient OSCH loads and allows one to be more flexible about scheduling on a Hadoop cluster and an Oracle Database to perform load operations. The next lesson will talk about Oracle Data Pump files generated by OLH, and loaded using OSCH. It will also outline the pros and cons of using various load methods.  This will be followed up with a final tutorial lesson focusing on how to optimize OLH and OSCH for use on Oracle's engineered systems: specifically Exadata and the BDA. /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;}

    Read the article

  • Reading a simple Avro file from HDFS

    - by John Galt... who
    I am trying to do a simple read of an Avro file stored in HDFS. I found out how to read it when it is on the local file system.... FileReader reader = DataFileReader.openReader(new File(filename), new GenericDatumReader()); for (GenericRecord datum : fileReader) { String value = datum.get(1).toString(); System.out.println("value = " value); } reader.close(); My file is in HDFS, however. I cannot give the openReader a Path or an FSDataInputStream. How can I simply read an Avro file in HDFS? EDIT: I got this to work by creating a custom class (SeekableHadoopInput) that implements SeekableInput. I "stole" this from "Ganglion" on github. Still, seems like there would be a Hadoop/Avro integration path for this. Thanks

    Read the article

  • FileInputStream for a generic file System

    - by Akhil
    I have a file that contains java serialized objects like "Vector". I have stored this file over Hadoop Distributed File System(HDFS). Now I intend to read this file (using method readObject) in one of the map task. I suppose FileInputStream in = new FileInputStream("hdfs/path/to/file"); wont' work as the file is stored over HDFS. So I thought of using org.apache.hadoop.fs.FileSystem class. But Unfortunately it does not have any method that returns FileInputStream. All it has is a method that returns FSDataInputStream but I want a inputstream that can read serialized java objects like vector from a file rather than just primitive data types that FSDataInputStream would do. Please help!

    Read the article

  • Hadoop safemode recovery - taking lot of time

    - by Algorist
    Hi, We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services. 609 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode' 610 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode' 611 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker' 612 613 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait' On the slave machine, we run the below services. 625 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode' 626 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker' The main problem we are facing is, hdfs safemode recovery is taking more than an hour and this is causing delays in our job completion. Below are the main log messages. 1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO ipc.Client: Retrying connect to server: ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already tried 21 time(s). 2. The reported blocks 283634 needs additional 322258 blocks to reach the threshold 0.9990 of total blocks 606499. Safe mode will be turned off automatically. The first message is thrown in task trackers log because, job tracker is not started. job tracker didn't start because of hdfs safemode recovery. The second message is thrown during the recovery process. Is there something I am doing wrong? How much time does normal hdfs safemode recovery takes? Will there be any speedup, by not starting task trackers till job tracker is started? Are there any known hadoop problems on amazon cluster? Thanks for your help. Regards Bala Mudiam

    Read the article

  • Why do HDFS clusters have only a single NameNode?

    - by grautur
    I'm trying to understand better how Hadoop works, and I'm reading The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy. Hadoop 0.21+ has a BackupNameNode that is part of a plan to have an HA name service, but it needs active contributions from the people who want it (i.e. you) to make it Highly Available. from http://wiki.apache.org/hadoop/NameNode So why is the NameNode a single point of failure? What is bad or difficult about having a complete duplicate of the NameNode running as well?

    Read the article

  • Are there any existing batch log file aggregation solutions?

    - by Mohan Gulati
    I wish to export from multiple nodes log files (in my case apache access and error logs) and aggregate that data in batch, as a scheduled job. I have seen multiple solutions that work with streaming data (i.e think scribe). I would like a tool that gives me the flexibility to define the destination. This requirement comes from the fact that I want to use HDFS as the destination. I have not been able to find a tool that supports this in batch. Before re-creating the wheel I wanted to ask the StackOverflow community for their input. If a solution exists already in python that would be even better.

    Read the article

  • Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

    - by Eric Charles
    We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]). This was only from time to time (1 time on around 100) and was not reproducable. Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate(). Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory"). From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete. The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted. Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS? We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev. [1] org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856) at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.j [2] @Override public boolean deleteResource(JobConf conf) throws IOException { LOGGER.info("Deleting resource {}", getIdentifier()); boolean isDeleted = super.deleteResource(conf); LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted, getIdentifier()); Path path = new Path(getIdentifier()); int retryCount = 0; int cumulativeSleepTime = 0; int sleepTime = 1000; while (getFileSystem(conf).exists(path)) { LOGGER .info( "Resource {} still exists, it should not... - I will continue to wait patiently...", getIdentifier()); try { LOGGER.info("Now I will sleep " + sleepTime / 1000 + " seconds while trying to delete {} - attempt: {}", getIdentifier(), retryCount + 1); Thread.sleep(sleepTime); cumulativeSleepTime += sleepTime; sleepTime *= 2; } catch (InterruptedException e) { e.printStackTrace(); LOGGER .error( "Interrupted while sleeping trying to delete {} with message {}...", getIdentifier(), e.getMessage()); throw new RuntimeException(e); } if (retryCount == 0) { getFileSystem(conf).delete(getPath(), true); } retryCount++; if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) { break; } } if (getFileSystem(conf).exists(path)) { LOGGER .error( "We didn't succeed to delete the resource {}. Throwing now a runtime exception.", getIdentifier()); throw new RuntimeException( "Although we waited to delete the resource for " + getIdentifier() + ' ' + retryCount + " iterations, it still exists - This must be an issue in the underlying storage system."); } return isDeleted; } [3] INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969 INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1 INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://... INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://... ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception. WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ... java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system. at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179) at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40) at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971) at cascading.flow.BaseFlow.prepare(BaseFlow.java:733) at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761) at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619)

    Read the article

  • Oracle Big Data Software Downloads

    - by Mike.Hallett(at)Oracle-BI&EPM
    Companies have been making business decisions for decades based on transactional data stored in relational databases. Beyond that critical data, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Oracle offers a broad integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data Connectors Downloads here, includes: Oracle SQL Connector for Hadoop Distributed File System Release 2.1.0 Oracle Loader for Hadoop Release 2.1.0 Oracle Data Integrator Companion 11g Oracle R Connector for Hadoop v 2.1 Oracle Big Data Documentation The Oracle Big Data solution offers an integrated portfolio of products to help you organize and analyze your diverse data sources alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data, Release 2.2.0 - E41604_01 zip (27.4 MB) Integrated Software and Big Data Connectors User's Guide HTML PDF Oracle Data Integrator (ODI) Application Adapter for Hadoop Apache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases. Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Designing and implementing a MapReduce job usually requires expert programming knowledge. However, when you use Oracle Data Integrator with the Application Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs. Employing familiar and easy-to-use tools and pre-configured knowledge modules (KMs), the application adapter provides the following capabilities: Loading data into Hadoop from the local file system and HDFS Performing validation and transformation of data within Hadoop Loading processed data from Hadoop to an Oracle database for further processing and generating reports Oracle Database Loader for Hadoop Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. It pre-partitions the data if necessary and transforms it into a database-ready format. Oracle Loader for Hadoop is a Java MapReduce application that balances the data across reducers to help maximize performance. Oracle R Connector for Hadoop Oracle R Connector for Hadoop is a collection of R packages that provide: Interfaces to work with Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables Predictive analytic techniques, written in R or Java as Hadoop MapReduce jobs, that can be applied to data in HDFS files You install and load this package as you would any other R package. Using simple R functions, you can perform tasks such as: Access and transform HDFS data using a Hive-enabled transparency layer Use the R language for writing mappers and reducers Copy data between R memory, the local file system, HDFS, Hive, and Oracle databases Schedule R programs to execute as Hadoop MapReduce jobs and return the results to any of those locations Oracle SQL Connector for Hadoop Distributed File System Using Oracle SQL Connector for HDFS, you can use an Oracle Database to access and analyze data residing in Hadoop in these formats: Data Pump files in HDFS Delimited text files in HDFS Hive tables For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS. Oracle SQL Connector for HDFS uses external tables to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. Related Documentation Cloudera's Distribution Including Apache Hadoop Library HTML Oracle R Enterprise HTML Oracle NoSQL Database HTML Recent Blog Posts Big Data Appliance vs. DIY Price Comparison Big Data: Architecture Overview Big Data: Achieve the Impossible in Real-Time Big Data: Vertical Behavioral Analytics Big Data: In-Memory MapReduce Flume and Hive for Log Analytics Building Workflows in Oozie

    Read the article

  • Hadoop in a RESTful Java Web Application - Conflicting URI templates

    - by user1231583
    I have a small Java Web Application in which I am using Jersey 1.12 and the Hadoop 1.0.0 JAR file (hadoop-core-1.0.0.jar). When I deploy my application to my JBoss 5.0 server, the log file records the following error: SEVERE: Conflicting URI templates. The URI template / for root resource class org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods and the URI template / transform to the same regular expression (/.*)? To make sure my code is not the problem, I have created a fresh web application that contains nothing but the Jersey and Hadoop JAR files along with a small stub. My web.xml is as follows: <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"> <servlet> <servlet-name>ServletAdaptor</servlet-name> <servlet-class>com.sun.jersey.spi.container.servlet.ServletContainer</servlet- class> <load-on-startup>1</load-on-startup> </servlet> <servlet-mapping> <servlet-name>ServletAdaptor</servlet-name> <url-pattern>/mytest/*</url-pattern> </servlet-mapping> <session-config> <session-timeout> 30 </session-timeout> </session-config> <welcome-file-list> <welcome-file>index.jsp</welcome-file> </welcome-file-list> </web-app> My simple RESTful stub is as follows: import javax.ws.rs.core.Context; import javax.ws.rs.core.UriInfo; import javax.ws.rs.Path; @Path("/mytest") public class MyRest { @Context private UriInfo context; public MyRest() { } } In my regular application, when I remove the Hadoop JAR files (and the code that is using Hadoop), everything works as I would expect. The deployment is successful and the remaining RESTful services work. I have also tried the Hadoop 1.0.1 JAR files and have had the same problems with the conflicting URL template in the NamenodeWebHdfsMethods class. Any suggestions or tips in solving this problem would be greatly appreciated.

    Read the article

  • Reducer getting fewer records than expected

    - by sathishs
    We have a scenario of generating unique key for every single row in a file. we have a timestamp column but the are multiple rows available for a same timestamp in few scenarios. We decided unique values to be timestamp appended with their respective count as mentioned in the below program. Mapper will just emit the timestamp as key and the entire row as its value, and in reducer the key is generated. Problem is Map outputs about 236 rows, of which only 230 records are fed as an input for reducer which outputs the same 230 records. public class UniqueKeyGenerator extends Configured implements Tool { private static final String SEPERATOR = "\t"; private static final int TIME_INDEX = 10; private static final String COUNT_FORMAT_DIGITS = "%010d"; public static class Map extends Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text row, Context context) throws IOException, InterruptedException { String input = row.toString(); String[] vals = input.split(SEPERATOR); if (vals != null && vals.length >= TIME_INDEX) { context.write(new Text(vals[TIME_INDEX - 1]), row); } } } public static class Reduce extends Reducer<Text, Text, NullWritable, Text> { @Override protected void reduce(Text eventTimeKey, Iterable<Text> timeGroupedRows, Context context) throws IOException, InterruptedException { int cnt = 1; final String eventTime = eventTimeKey.toString(); for (Text val : timeGroupedRows) { final String res = SEPERATOR.concat(getDate( Long.valueOf(eventTime)).concat( String.format(COUNT_FORMAT_DIGITS, cnt))); val.append(res.getBytes(), 0, res.length()); cnt++; context.write(NullWritable.get(), val); } } } public static String getDate(long time) { SimpleDateFormat utcSdf = new SimpleDateFormat("yyyyMMddhhmmss"); utcSdf.setTimeZone(TimeZone.getTimeZone("America/Los_Angeles")); return utcSdf.format(new Date(time)); } public int run(String[] args) throws Exception { conf(args); return 0; } public static void main(String[] args) throws Exception { conf(args); } private static void conf(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf, "uniquekeygen"); job.setJarByClass(UniqueKeyGenerator.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); // job.setNumReduceTasks(400); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } It is consistent for higher no of lines and the difference is as huge as 208969 records for an input of 20855982 lines. what might be the reason for reduced inputs to reducer?

    Read the article

  • Error when running a basic Hadoop code

    - by Abhishek Shivkumar
    I am running a hadoop code that has a partitioner class inside the job. But, when I run the command hadoop jar Sort.jar SecondarySort inputdir outputdir I am getting a runtime error that says class KeyPartitioner not org.apache.hadoop.mapred.Partitioner. I have ensured that the KeyPartitioner class has extended the Partitioner class, but why am I getting this error? Here is the driver code: JobConf conf = new JobConf(getConf(), SecondarySort.class); conf.setJobName(SecondarySort.class.getName()); conf.setJarByClass(SecondarySort.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setMapOutputKeyClass(StockKey.class); conf.setMapOutputValueClass(Text.class); conf.setPartitionerClass((Class<? extends Partitioner<StockKey, DoubleWritable>>) KeyPartitioner.class); conf.setMapperClass((Class<? extends Mapper<LongWritable, Text, StockKey, DoubleWritable>>) StockMapper.class); conf.setReducerClass((Class<? extends Reducer<StockKey, DoubleWritable, Text, Text>>) StockReducer.class); and here is the code of the partitioner class: public class KeyPartitioner extends Partitioner<StockKey, Text> { @Override public int getPartition(StockKey arg0, Text arg1, int arg2) { int partition = arg0.name.hashCode() % arg2; return partition; } }

    Read the article

  • What is Best storage servers infrastructure ? DAS/NAS/SAN or installing GlusterFS/LUSTER/HDFS/RBDB

    - by TORr0t
    I am trying to design an infrastucture for the project I am working on. It would be somehow a file-sharing/downloading project (like rapidshare) and I would need high storage sizes and good scability, and I would add new storage nodes after my project grows up. I have come up with 3 solutions for my project which are using Luster, GlusterFS, HDFS, RDBD. For start, i would have 2 servers, one server is for glusterfs client + webserver + db server+ a streaming server, and the other server is gluster storage node. (After sometime, i would be adding more node servers, and client servers (dont know how many new client new servers to add, will see later) So, i am thinking to work with glusterfs. But i really wonder that if i have to use high performance servers with high sotrage sizes or avarage/slow servers with high storage sizes? Or nas/das/san solutions are better for glusterfs storage nodes? I might buy a nas and install glusterfs onto it. I would be happy to listen to your recommendations for the server properties (for each clients and nodes) . I really dont know if I really need high amount of ram and good cpus to for the nodes. I am sure i need it for client servers. The files would be streamed as well, so the Automatic file replication is important, thus, my system should work like a cloud, when needed, according to high traffic, the storage nodes should copy the most demanded file to be streamed and would help me to get rid of scability problems and my visitors would able to stream/download those files. Also, i am open to your experiences/thoughts about any good solution. Luster, hdfs, rbdb are the other options and i would be happy to listen to your thoughts here. I would be very very happy to hear back from anyone commented of any words I have used here. Thanks

    Read the article

1 2 3 4  | Next Page >