Search Results

Search found 196 results on 8 pages for 'mapreduce'.

Page 7/8 | < Previous Page | 3 4 5 6 7 8  | Next Page >

  • Building Simple Workflows in Oozie

    - by dan.mcclary
    Introduction More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site. Hive Actions: Prepping for Pig In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie. I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code. CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2) PARTITIONED BY (yr string) STORED AS ... LOCATION '/user/oracle/weather/historic'; As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access. ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010') LOCATION '/user/oracle/weather/historic/yr=2011'; INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history' SELECT w.stn, w.wban, w.weather_year, w.weather_month, w.weather_day, w.temp, w.dewp, w.weather FROM ( FROM historic_weather SELECT TRANSFORM(...) USING '/path/to/hive/filters/ncdc_parser.py' as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather ) w; Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called weather_train.hql. Starting Our Workflow Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point: <workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1"> <start to="ParseNCDCData"/> <end name="end"/> </workflow-app> To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure. <action name="ParseNCDCData"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <configuration> <property> <name>oozie.hive.defaults</name> <value>/user/oracle/weather_ooze/hive-default.xml</value> </property> </configuration> <script>ncdc_parse.hql</script> </hive> <ok to="WeatherMan"/> <error to="end"/> </action> There are a couple of things to note here: I have to give the FQDN (or IP) and port of my JobTracker and NameNode. I have to include a hive-default.xml file. I have to include a script file. The hive-default.xml and script file must be stored in HDFS That last point is particularly important. Oozie doesn't make assumptions about where a given workflow is being run. You might submit workflows against different clusters, or have different hive-defaults.xml on different clusters (e.g. MySQL or Postgres-backed metastores). A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions to workflow.xml. At this point, our local directory should contain: workflow.xml hive-defaults.xml (make sure this file contains your metastore connection data) ncdc_parse.hql Adding Pig to the Ooze Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action to workflow.xml as follows: <action name="WeatherMan"> <pig> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <script>weather_train.pig</script> </pig> <ok to="end"/> <error to="end"/> </action> Once we've done this, we'll copy weather_train.pig to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes. While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything under working_directory/lib gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in the working_directory/lib directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an <archive> tag. Yes, that's as confusing as you think it is. You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook. Making the Workflow Work We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called job.properties as follows: nameNode=hdfs://localhost:8020 jobTracker=localhost:8021 queueName=default weatherRoot=weather_ooze mapreduce.jobtracker.kerberos.principal=foo dfs.namenode.kerberos.principal=foo oozie.libpath=${nameNode}/user/oozie/share/lib oozie.wf.application.path=${nameNode}/user/${user.name}/${weatherRoot} outputDir=weather-ooze While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is weatherRoot: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. The oozie.libpath pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory. We're finally ready to submit our job! After all that work we only need to do a few more things: Validate our workflow.xml Copy our working directory to HDFS Submit our job to the Oozie server Run our workflow Let's do them in order. First validate the workflow: oozie validate workflow.xml Next, copy the working directory up to HDFS: hadoop fs -put working_dir /user/oracle/working_dir Now we submit the job to the Oozie server. We need to ensure that we've got the correct URL for the Oozie server, and we need to specify our job.properties file as an argument. oozie job -oozie http://url.to.oozie.server:port_number/ -config /path/to/working_dir/job.properties -submit We've submitted the job, but we don't see any activity on the JobTracker? All I got was this funny bit of output: 14-20120525161321-oozie-oracle This is because submitting a job to Oozie creates an entry for the job and places it in PREP status. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job. oozie -oozie http://url.to.oozie.server:port_number/ -start 14-20120525161321-oozie-oracle Of course, if we really want to run the job from the outset, we can change the "-submit" argument above to "-run." This will prep and run the workflow immediately. Takeaway So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.

    Read the article

  • New Big Data Appliance Security Features

    - by mgubar
    The Oracle Big Data Appliance (BDA) is an engineered system for big data processing.  It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing.  The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security.  Oracle Database customers have benefited from a rich set of security features:  encryption, redaction, data masking, database firewall, label based access control – and much, much more.  They want similar capabilities with their Hadoop cluster.    Unfortunately, Hadoop wasn’t developed with security in mind.  By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database.  Some critical security features have been implemented – but even those capabilities are arduous to setup and configure.  Oracle believes that a key element of an optimized appliance is that its data should be secure.  Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing. Security Starts at Authentication A successful security strategy is predicated on strong authentication – for both users and software services.  Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”.  The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users. Default Authentication in Hadoop By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system.  Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user.  In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt.  When logged in as the hr user, you can see the following files.  Notice, we’re using the Hadoop command line utilities for accessing the data: $ hadoop fs -ls /user/hrdataFound 1 items-rw-r--r--   1 oracle supergroup         70 2013-10-31 10:38 /user/hrdata/salaries.txt$ hadoop fs -cat /user/hrdata/salaries.txtTom Brady,11000000Tom Hanks,5000000Bob Smith,250000Oprah,300000000 User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”.  $ hadoop fs -ls /user Found 1 items drwx------   - hr supergroup          0 2013-10-31 10:38 /user/hrdata However, DrEvil cannot view the contents of the folder due to lack of access privileges: $ hadoop fs -ls /user/hrdata ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------ Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster: $ sudo useradd hr $ sudo passwd $ su hr $ hadoop fs -cat /user/hrdata/salaries.txt Tom Brady,11000000 Tom Hanks,5000000 Bob Smith,250000 Oprah,300000000 Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS. Big Data Appliance Provides Secure Authentication The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration. Figure 1: Kerberos Integration The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc. At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system. It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment. Authorize Access to Sensitive Data Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server. For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration. Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements: Secure Authorization:  the ability to control access to data and/or privileges on data for authenticated users. Fine-Grained Authorization:  the ability to give users access to a subset of the data (e.g. column) in a database Role-Based Authorization:  the ability to create/apply template-based privileges based on functional roles. With Sentry, “all”, “select” or “insert” privileges are granted to an object. The descendants of that object automatically inherit that privilege. A collection of privileges across many objects may be aggregated into a role – and users/groups are then assigned that role. This leads to simplified administration of security across the system. Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views. Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future. Audit Hadoop Cluster Activity Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster: Figure 3: Monitored Hadoop services. At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include: MapReduce:  correlate the MapReduce job that accessed the file Oozie:  describes who ran what as part of a workflow Hive:  captures changes were made to the Hive metadata The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA. Figure 4: Consolidated audit data across the enterprise.  Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies. Conclusion Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.

    Read the article

  • Functional Methods on Collections

    - by GlenPeterson
    I'm learning Scala and am a little bewildered by all the methods (higher-order functions) available on the collections. Which ones produce more results than the original collection, which ones produce less, and which are most appropriate for a given problem? Though I'm studying Scala, I think this would pertain to most modern functional languages (Clojure, Haskell) and also to Java 8 which introduces these methods on Java collections. Specifically, right now I'm wondering about map with filter vs. fold/reduce. I was delighted that using foldRight() can yield the same result as a map(...).filter(...) with only one traversal of the underlying collection. But a friend pointed out that foldRight() may force sequential processing while map() is friendlier to being processed by multiple processors in parallel. Maybe this is why mapReduce() is so popular? More generally, I'm still sometimes surprised when I chain several of these methods together to get back a List(List()) or to pass a List(List()) and get back just a List(). For instance, when would I use: collection.map(a => a.map(b => ...)) vs. collection.map(a => ...).map(b => ...) The for/yield command does nothing to help this confusion. Am I asking about the difference between a "fold" and "unfold" operation? Am I trying to jam too many questions into one? I think there may be an underlying concept that, if I understood it, might answer all these questions, or at least tie the answers together.

    Read the article

  • C++ Numerical Recipes &ndash; A New Adventure!

    - by JoshReuben
    I am about to embark on a great journey – over the next 6 weeks I plan to read through C++ Numerical Recipes 3rd edition http://amzn.to/YtdpkS I'll be reading this with an eye to C++ AMP, thinking about implementing the suitable subset (non-recursive, additive, commutative) to run on the GPU. APIs supporting HPC, GPGPU or MapReduce are all useful – providing you have the ability to choose the correct algorithm to leverage on them. I really think this is the most fascinating area of programming – a lot more exciting than LOB CRUD !!! When you think about it , everything is a function – we categorize & we extrapolate. As abstractions get higher & less leaky, sooner or later information systems programming will become a non-programmer task – you will be using WYSIWYG designers to build: GUIs MVVM service mapping & virtualization workflows ORM Entity relations In the data source SharePoint / LightSwitch are not there yet, but every iteration gets closer. For information workers, managed code is a race to the bottom. As MS futures are a bit shaky right now, the provider agnostic nature & higher barriers of entry of both C++ & Numerical Analysis seem like a rational choice to me. Its also fascinating – stepping outside the box. This is not the first time I've delved into numerical analysis. 6 months ago I read Numerical methods with Applications, which can be found for free online: http://nm.mathforcollege.com/ 2 years ago I learned the .NET Extreme Optimization library www.extremeoptimization.com – not bad 2.5 years ago I read Schaums Numerical Analysis book http://amzn.to/V5yuLI - not an easy read, as topics jump back & forth across chapters: 3 years ago I read Practical Numerical Methods with C# http://amzn.to/V5yCL9 (which is a toy learning language for this kind of stuff) I also read through AI a Modern Approach 3rd edition END to END http://amzn.to/V5yQSp - this took me a few years but was the most rewarding experience. I'll post progress updates – see you on the other side !

    Read the article

  • Should CV contain cases where certain party fools me and not pay salary? Is it "holiday" time or not?

    - by otto
    Suppose I am fooled to work in a start-up or a company that has been a very small over a long time, let say 10 years. I work them 3 months and I really enjoy the work -- I learn a lot of new skills such as Haskell, MapReduce, CouchDB and many other little things. Now the firm did not pay any salary: A) I may be unskilled, B) I did not meet some deadline (I don't know because I am not allowed to speak to the boss but I know that I am not getting any payment) or C) I was fooled. Some detail about C I heard that the firm have had similar cases from my friend, "The guy X was there and he said he does not trust the firm at all so he went to other firm". I don't know what the term "trust" mean here, anyway the firm consists of ignorant drop-outs that hires academic people, a bit irony. They hire people from student-organizations and let them work and promise ok -compensation but -- when you start working the co-employer starts all kind of instructions "Do not work so hard, do not work so long, do not work so much" -- it is like he is making sure you do not feel sad when he does not pay any salary (co-employer is an owner in the firm). Anyway, I learnt a ton in the company but it was very inefficient working. I worked only alone, not really working in a "company". Now should by resume contain references to the firm and the guy who did not pay me anything? Or should my resume read that I worked in XYZ -technologies -- but 1 year's NDA -- what can write here? Now I fear that if I put the firm to my resume: they will lie about my input to my next employer. I feel they are very dishonest. On the other hand, I want to make it sure that I have worked over the time. So: Should my CV contain the not-so-good or even awful employers that may be fooling people to work there? I am pretty sure everyone knows the firm and its habbits, circles are small but people are afraid to speak.

    Read the article

  • Welcome!

    - by mannamal
    Welcome to the Oracle Big Data Connectors blog, which will focus on posts related to integrating data on a Hadoop cluster with Oracle Database. In particular the blog will focus on best practices, usage notes, and performance tips for using Oracle Loader for Hadoop and Oracle Direct Connector for HDFS, which are part of Oracle Big Data Connectors. Oracle Big Data Connectors 1.0 also includes Oracle R Connector for Hadoop and Oracle Data Integrator Application Adapters for Hadoop. Oracle Loader for Hadoop: Oracle Loader for Hadoop loads data from Hadoop to Oracle Database. It runs as a MapReduce job on Hadoop to partition, sort, and convert the data into an Oracle-ready format, offloading to Hadoop the processing that is typically done using database CPUs. The data is thenloaded to the database by the Oracle Loader for Hadoop job (online load) or written out as Oracle Data Pump files for load and access later (offline load) with Oracle Direct Connector for HDFS. Oracle Direct Connector for HDFS: Oracle Direct Connector for HDFS is a connector for high speed access of data on HDFS from Oracle Database. With this connector Oracle SQL can be used to directly query data on HDFS. The data can be Oracle Data Pump files generated by Oracle Loader for Hadoop or delimited text files. The connector can also be used to load data into the database using SQL.

    Read the article

  • ???: Oracle NoSQL Database??

    - by zhangqm
    ?????????Oracle?????Oracle NoSQL Database,?????NoSQL Database ??????????Oracle NoSQL Database??2???,Community Edition ?Enterprise Edition?????????NoSQL Database 11g R2 (11gR2.1.2.123). ?????????????????: Oracle NoSQL Database OTN portal (includes download facility) Oracle NoSQL Database OTN documentation Oracle NoSQL Database license information ??Oracle NoSQL Database ???????????,????,?????(key-value)???TB????,????????????(???)????,??????????????????????????,????,??????????? ?Oracle NoSQL Database?,???????????key-value???,??key???????:??????????key?(?????string),????????(??????????bytes)??????key-value ??primary key?hash?,????????????????????????????????,???????,????????????????????????? ???????????????????Java API??????Oracle NoSQL Database driver ????????,?????key-value????????????????Oracle NoSQL Database ?????Create, Read, Update and Delete (CRUD)??,???????durability??????????????????????:?????web console???command line??? Oracle Berkeley DB Java Edition Oracle NoSQL Database?? Oracle Berkeley DB Java Edition ????????,??????????????????????????????,?????????????????? ????????????Oracle NoSQL Database Driver?????key-value????????????Oracle NoSQL Database Driver??:?????????hash??????????????????,?????????????????????? ????????Oracle NoSQL Database Oracle NoSQL Database????????????????????????????????????????????????????????????????: ???? ???? ???? ?????? ???? ?????? ????,??,?? ???? ???? ??? (sub-millisecond) ???????? ????? ??????? ????????  ?????Oracle?????? ???? (Oracle Big Data Appliance) ???? ?????????????????????????????????,???“??”???????????,Oracle NoSQL Database???????????Oracle NoSQL Database?????(Cloud)??,????????(TB?PB??)???Oracle NoSQL Database ??????ETL??(??MapReduce, Hadoop)??,??acquire-organize-analyze ?????????? ???????Oracle NoSQL Database?????: • Large schema-less data repositories• Web?? (click-through capture)• ????• ????• ?????????? • Sensor/statistics/network capture (?????, ?????)• ?????????• ???? (MMS, SMS, routing)• ???? Oracle NoSQL Database (Community Edition ??)??????????? Oracle Big Data Appliance???

    Read the article

  • facing problems while updating rows in hbase

    - by sammy
    Hello i've just started exploring hbase i've run samples : SampleUploader,PerformanceEvaluation and rowcount as given in hadoop wiki: http://wiki.apache.org/hadoop/Hbase/MapReduce The problem im facing is : table1 is my table with the column family column create 'table1','column' put 'table1','row1','column:address','SanFrancisco' hbase(main):020:0 scan 'table1' ROW COLUMN+CELL row1 column=column:address, timestamp=1276351974560, value=SanFrancisco put 'table1','row1','column:name','Hannah' hbase(main):020:0 scan 'table1' ROW COLUMN+CELL row1 column=column:address,timestamp=1276351974560,value=SanFrancisco row1 column=column:name, timestamp=1276351899573, value=Hannah i want both the columns to appear in the same row as a different version similary, if i change the name column to sarah, it shows the updated row.... but i want both the old row and the changed row to appear as 2 different versions so that i could make analysis on the data........ whatis the mistake im making???? thank u a lot sammy

    Read the article

  • Searching for cluster computation framework

    - by petkov_d
    I have a library, written in C#, containing one method: Response CalculateSomething(Request); The execution time of this method is relatively large, and there are a lot of responses that should be processed. I want to use a "cluster", spread this DLL to different machines (nodes) in this "cluster" and write some controller that will distribute responses to the nodes. There should be mechanism that perevent losing task because of node crush, load balancing. Can someone suggest framework that addresses this issue? P.S. There is a framework Qizmt written in C# but I think MapReduce is not good for the above scenario

    Read the article

  • 'e-Commerce' scalable database model

    - by Ruben Trancoso
    I would like to understand database scalability so I've just heard a talk about Habits of Highly Scalable Web Applications http://techportal.ibuildings.com/2010/03/02/habits-of-highly-scalable-web-applications/ On it, the presenter mainly talk about relational database scalability. I also have read something about MapReduce and Column oriented tables, big tables, hypertable etc... trying to understand which are the most up to date methods to scale web application data. But the second group, to me, is being hard to understand where it fits. It serves as transactional, reliable data store? or not, its just for large access and processing and to handle fine graned operations we will ever need to rely on RDBMSs? Could someone give a comprehensive landscape for those new technologies and how to use it?

    Read the article

  • Pig: Count number of keys in a map

    - by Donald Miner
    I'd like to count the number of keys in a map in Pig. I could write a UDF to do this, but I was hoping there would be an easier way. data = LOAD 'hbase://MARS1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'A:*', '-loadKey true -caching=100000') AS (id:bytearray, A_map:map[]); In the code above, I want to basically build a histogram of id and how many items in column family A that key has. In hoping, I tried c = FOREACH data GENERATE id, COUNT(A_map); but that unsurprisingly didn't work. Or, perhaps someone can suggest a better way to do this entirely. If I can't figure this out soon I'll just write a Java MapReduce job or a Pig UDF.

    Read the article

  • Reverse massive text file in Java

    - by DanJanson
    What would be the best approach to reverse a large text file that is uploaded asynchronously to a servlet that reverses this file in a scalable and efficient way? text file can be massive (gigabytes long) can assume mulitple server/clustered environment to do this in a distributed manner. open source libraries are encouraged to consider I was thinking of using Java NIO to treat file as an array on disk (so that I don't have to treat the file as a string buffer in memory). Also, I am thinking of using MapReduce to break up the file and process it in separate machines. Any input is appreciated. Thanks. Daniel

    Read the article

  • How do I use Amazon's new RRS for S3?

    - by pbarney
    Reduced Redundancy Storage (RRS) is a new service from Amazon that is a bit cheaper than S3 because there is less redundancy. However, I can not find any information on how to specify that my data should use RRS rather than standard S3. In fact, there doesn't seem to be any website interface for an S3 services. If I log into AWS, there are only options for EC2, Elastic MapReduce, CloudFront and RDS, none of which I use. Any insight? Thanks.

    Read the article

  • I have an Errno 13 Permission denied with subprocess in python

    - by wDroter
    The line with the issue is ret=subprocess.call(shlex.split(cmd)) cmd = /usr/share/java -cp pig-hadoop-conf-Simpsons:lib/pig-0.8.1-cdh3u1-core.jar:lib/hadoop-core-0.20.2-cdh3u1.jar org.apache.pig.Main -param func=cat -param from =foo.txt -x mapreduce fsFunc.pig The error is. File "./run_pig.py", line 157, in process ret=subprocess.call(shlex.split(cmd)) File "/usr/lib/python2.7/subprocess.py", line 493, in call return Popen(*popenargs, **kwargs).wait() File "/usr/lib/python2.7/subprocess.py", line 679, in __init__ errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child raise child_exception OSError: [Errno 13] Permission denied Let me know if any more info is needed. Any help is appreciated. Thanks.

    Read the article

  • Why are functional languages considered a boon for multi threaded environments?

    - by Billy ONeal
    I hear a lot about functional languages, and how they scale well because there is no state around a function; and therefore that function can be massively parallelized. However, this makes little sense to me because almost all real-world practical programs need/have state to take care of. I also find it interesting that most major scaling libraries, i.e. MapReduce, are typically written in imperative languages like C or C++. I'd like to hear from the functional camp where this hype I'm hearing is coming from....

    Read the article

  • Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the HIVE in Big Data Story. In this article we will understand what is PIG and PIG Latin in Big Data Story. Yahoo started working on Pig for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. What is Pig and What is Pig Latin? Pig is a high level platform for creating MapReduce programs used with Hadoop and the language we use for this platform is called PIG Latin. The pig was designed to make Hadoop more user-friendly and approachable by power-users and nondevelopers. PIG is an interactive execution environment supporting Pig Latin language. The language Pig Latin has supported loading and processing of input data with series of transforming to produce desired results. PIG has two different execution environments 1) Local Mode – In this case all the scripts run on a single machine. 2) Hadoop – In this case all the scripts run on Hadoop Cluster. Pig Latin vs SQL Pig essentially creates set of map and reduce jobs under the hoods. Due to same users does not have to now write, compile and build solution for Big Data. The pig is very similar to SQL in many ways. The Ping Latin language provide an abstraction layer over the data. It focuses on the data and not the structure under the hood. Pig Latin is a very powerful language and it can do various operations like loading and storing data, streaming data, filtering data as well various data operations related to strings. The major difference between SQL and Pig Latin is that PIG is procedural and SQL is declarative. In simpler words, Pig Latin is very similar to SQ Lexecution plan and that makes it much easier for programmers to build various processes. Whereas SQL handles trees naturally, Pig Latin follows directed acyclic graph (DAG). DAGs is used to model several different kinds of structures in mathematics and computer science. DAG Tomorrow In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Zookeeper. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Tackling Big Data Analytics with Oracle Data Integrator

    - by Irem Radzik
    v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman";}  By Mike Eisterer  The term big data draws a lot of attention, but behind the hype there's a simple story. For decades, companies have been making business decisions based on transactional data stored in relational databases. Beyond that critical data, however, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and documents that can be mined for useful information.  Companies are facing emerging technologies, increasing data volumes, numerous data varieties and the processing power needed to efficiently analyze data which changes with high velocity. Oracle offers the broadest and most integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships Oracle Data Integrator Enterprise Edition(ODI) is critical to any enterprise big data strategy. ODI and the Oracle Data Connectors provide native access to Hadoop, leveraging such technologies as MapReduce, HDFS and Hive. Alongside with ODI’s metadata driven approach for extracting, loading and transforming data; companies may now integrate their existing data with big data technologies and deliver timely and trusted data to their analytic and decision support platforms. In this session, you’ll learn about ODI and Oracle Big Data Connectors and how, coupled together, they provide the critical integration with multiple big data platforms. Tackling Big Data Analytics with Oracle Data Integrator October 1, 2012 12:15 PM at MOSCONE WEST – 3005 For other data integration sessions at OpenWorld, please check our Focus-On document.  If you are not able to attend OpenWorld, please check out our latest resources for Data Integration.

    Read the article

  • Java: How to manage UDP client-server state

    - by user92947
    I am trying to write a Java application that works similar to MapReduce. There is a server and several workers. Workers may come and go as they please and the membership to the group has a soft-state. To become a part of the group, the worker must send a UDP datagram to the server, but to continue to be part of the group, the worker must send the UDP datagram to the server every 5 minutes. In order to accommodate temporary errors, a worker is allowed to miss as many as two consecutive periodic UDP datagrams. So, the server must keep track of the current set of workers as well as the last time each worker had sent a UDP datagram. I've implemented a class called WorkerListener that implements Runnable and listens to UDP datagrams on a particular UDP port. Now, to keep track of active workers, this class may maintain a HashSet (or HashMap). When a datagram is received, the server may query the HashSet to check if it is a new member. If so, it can add the new worker to the group by adding an entry into the HashSet. If not, it must reset a "timer" for the worker, noting that it has just heard from the corresponding worker. I'm using the word timer in a generic sense. It doesn't have to be a clock of sorts. Perhaps this could also be implemented using int or long variables. Also, the server must run a thread that continuously monitors the timers for the workers to see that a client that times out on two consecutive datagram intervals, it is removed from the HashSet. I don't want to do this in the WorkerListener thread because it would be blocking on the UDP datagram receive() function. If I create a separate thread to monitor the worker HashSet, it would need to be a different class, perhaps WorkerRegistrar. I must share the HashSet with that thread. Mutual exclusion must also be implemented, then. My question is, what is the best way to do this? Pointers to some sample implementation would be great. I want to use the barebones JDK implementation, and not some fancy state maintenance API that takes care of everything, because I want this to be a useful demonstration for a class that I am teaching. Thanks

    Read the article

  • OTN Virtual Technology Summit - July 9 - Middleware Track

    - by OTN ArchBeat
    The Architecture of Analytics: Big Time Big Data and Business Intelligence This four-session track, part of the free OTN Virtual Technology Summit on July 9, will present a solution architect's perspective on how business intelligence products in Oracle's Fusion Middleware family and beyond fit into an effective big data architecture, offering insight and expertise from Oracle ACE Directors and product team experts specializing in business Intelligence to help you meet your big data business intelligence challenges. Register now! Sessions Oracle Big Data Appliance Case Study: Using Big Data to Analyze Cancer-Genome Relationships Tom Plunkett, Lead Author of the Oracle Big Data Handbook What does it take to build an award winning Big Data solution? This presentation takes a deep technical dive into the use of the Oracle Big Data Appliance in a project for the National Cancer Institute's Frederick National Laboratory for Cancer Research. The Frederick National Laboratory and the Oracle team won several awards for analyzing relationships between genomes and cancer subtypes with big data, including the 2012 Government Big Data Solutions Award, the 2013 Excellence.Gov Finalist for Innovation, and the 2013 ComputerWorld Honors Laureate for Innovation. [30 mins] Getting Value from Big Data Variety Richard Tomlinson, Director, Product Management, Oracle Big data variety implies big data complexity. Performing analytics on diverse data typically involves mashing up structured, semi-structured and unstructured content. So how can we do this effectively to get real value? How do we relate diverse content so we can start to analyze it? This session looks at how we approach this tricky problem using Endeca Information Discovery. [30 mins] How To Leverage Your Investment In Oracle Business Intelligence Enterprise Edition Within a Big Data Architecture Oracle ACE Director Kevin McGinley More and more organizations are realizing the value Big Data technologies contribute to the return on investment in Analytics. But as an increasing variety of data types reside in different data stores, organizations are finding that a unified Analytics layer can help bridge the divide in modern data architectures. This session will examine how you can enable Oracle Business Intelligence Enterprise Edition (OBIEE) to play a role in a unified Analytics layer and the benefits and use cases for doing so. [30 mins] Oracle Data Integrator 12c As Your Big Data Data Integration Hub Oracle ACE Director Mark Rittman Oracle Data Integrator 12c (ODI12c), as well as being able to integrate and transform data from application and database data sources, also has the ability to load, transform and orchestrate data loads to and from Big Data sources. In this session, we'll look at ODI12c's ability to load data from Hadoop, Hive, NoSQL and file sources, transform that data using Hive and MapReduce processing across the Hadoop cluster, and then bulk-load that data into an Oracle Data Warehouse using Oracle Big Data Connectors. We will also look at how ODI12c enables ETL-offloading to a Hadoop cluster, with some tips and techniques on real-time capture into a Hadoop data reservoir and techniques and limitations when performing ETL on big data sources. [90 mins] Register now!

    Read the article

  • Enterprise with eyes on NoSQL

    - by thegreeneman
    Since joining Oracle a few months back, I have had the fortune of being able to interact with a number of large enterprise organizations and discuss their current state of adoption for NoSQL database technology.   It is worth noting that a large percentage of these organizations do have some NoSQL use and have been steadily increasing their understanding of its applicability for certain data management workloads.   Thru those discussions I’ve learned that it seems one of the biggest issues confronting enterprise adoption of NoSQL databases is the lack of standards for access, administration and monitoring.    This was not so much of an issue with the early adopters of NoSQL technology because they employed a highly DevOps centric approach to application deployment leaving a select few highly qualified developers with the task of managing the production of the system that they designed and implemented. However, as NoSQL technology moves out of the startup and into the hands of larger corporate entities, developers with a broad skill set that are capable of both development and I.T. type production management are in short supply and quickly get moved on to do new projects, often moving to different roles within the company.  This difference in the way smaller more agile startups operate as compared to more established companies is revealing a gap in the NoSQL technology segment that needs to get addressed.    This is one of places that a company such as Oracle has a leg up in the NoSQL Database front.  A combination of having gone thru a past database maturization process,  combined with a vast set of corporate relationships that have grown hand in hand to solve these types of issues, Oracle is in a great place to lead the way in closing the requirements gap for NoSQL technology.  Oracle's understanding of the needs specific to mature organizations have already made their way into the Oracle’s NoSQL Database offering with features such as:  One click cluster deployment with visual topology planning,  standards based monitoring protocols such as SNMP, support for data access for reporting via standard SQL  and integration with emerging standards for data access such as MapReduce.  Given the exciting developments we’re driving in the Oracle NoSQL Database group, I will have a lot more to say about this topic as we move into the second half of the year.

    Read the article

  • hdfs configuration

    - by Ananymous
    I am a newbie. Trying to setup a hdfs system to serve my data (I don't plan to use mapreduce) at my lab. So far I have read, cluster setup in but I am still confused. Several questions: Do I need to have a secondary namenode? There are 2 files, masters and slaves. Do I really need these 2 files eventhough I just want hdfs? If I need them, what should go in there? I assume my namenode in masters and datanodes as slaves? Do I need slaves nodes What configuration files are needed for namenode, secondary namenode, datanode and client? (I assume core-site.xml is needed for all 4)? In addition, can someone suggest a good configuration model? sample configuration for namenode, secondary namenode, datanode, and the client would be very helpful. I am getting confused because it seems most of the documentation assumes I want to use map-reduce which isn't the case.

    Read the article

  • Big Data – Beginning Big Data – Day 1 of 21

    - by Pinal Dave
    What is Big Data? I want to learn Big Data. I have no clue where and how to start learning about it. Does Big Data really means data is big? What are the tools and software I need to know to learn Big Data? I often receive questions which I mentioned above. They are good questions and honestly when we search online, it is hard to find authoritative and authentic answers. I have been working with Big Data and NoSQL for a while and I have decided that I will attempt to discuss this subject over here in the blog. In the next 21 days we will understand what is so big about Big Data. Big Data – Big Thing! Big Data is becoming one of the most talked about technology trends nowadays. The real challenge with the big organization is to get maximum out of the data already available and predict what kind of data to collect in the future. How to take the existing data and make it meaningful that it provides us accurate insight in the past data is one of the key discussion points in many of the executive meetings in organizations. With the explosion of the data the challenge has gone to the next level and now a Big Data is becoming the reality in many organizations. Big Data – A Rubik’s Cube I like to compare big data with the Rubik’s cube. I believe they have many similarities. Just like a Rubik’s cube it has many different solutions. Let us visualize a Rubik’s cube solving challenge where there are many experts participating. If you take five Rubik’s cube and mix up the same way and give it to five different expert to solve it. It is quite possible that all the five people will solve the Rubik’s cube in fractions of the seconds but if you pay attention to the same closely, you will notice that even though the final outcome is the same, the route taken to solve the Rubik’s cube is not the same. Every expert will start at a different place and will try to resolve it with different methods. Some will solve one color first and others will solve another color first. Even though they follow the same kind of algorithm to solve the puzzle they will start and end at a different place and their moves will be different at many occasions. It is  nearly impossible to have a exact same route taken by two experts. Big Market and Multiple Solutions Big Data is exactly like a Rubik’s cube – even though the goal of every organization and expert is same to get maximum out of the data, the route and the starting point are different for each organization and expert. As organizations are evaluating and architecting big data solutions they are also learning the ways and opportunities which are related to Big Data. There is not a single solution to big data as well there is not a single vendor which can claim to know all about Big Data. Honestly, Big Data is too big a concept and there are many players – different architectures, different vendors and different technology. What is Next? In this 31 days series we will be exploring many essential topics related to big data. I do not claim that you will be master of the subject after 31 days but I claim that I will be covering following topics in easy to understand language. Architecture of Big Data Big Data a Management and Implementation Different Technologies – Hadoop, Mapreduce Real World Conversations Best Practices Tomorrow In tomorrow’s blog post we will try to answer one of the very essential questions – What is Big Data? Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)? – Day 15 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the operational database in Big Data Story. In this article we will understand what is Hive and HQL in Big Data Story. Yahoo started working on PIG (we will understand that in the next blog post) for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. Similarly Facebook started deploying their warehouse solutions on Hadoop which has resulted in HIVE. The reason for going with HIVE is because the traditional warehousing solutions are getting very expensive. What is HIVE? Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there. HIVE Organization The data are organized in three different formats in HIVE. Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems. Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well. Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system. Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here. What is HiveSQL (HQL)? Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily. Create and manage tables and partitions Support various Relational, Arithmetic and Logical Operators Evaluate functions Download the contents of a table to a local directory or result of queries to HDFS directory Here is the example of the HQL Query: SELECT upper(name), salesprice FROM sales; SELECT category, count(1) FROM products GROUP BY category; When you look at the above query, you can see they are very similar to SQL like queries. Tomorrow In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Pig. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the Relational Database and NoSQL database in the Big Data Story. In this article we will understand the role of Key-Value Pair Databases and Document Databases Supporting Big Data Story. Now we will see a few of the examples of the operational databases. Relational Databases (Yesterday’s post) NoSQL Databases (Yesterday’s post) Key-Value Pair Databases (This post) Document Databases (This post) Columnar Databases (Tomorrow’s post) Graph Databases (Tomorrow’s post) Spatial Databases (Tomorrow’s post) Key Value Pair Databases Key Value Pair Databases are also known as KVP databases. A key is a field name and attribute, an identifier. The content of that field is its value, the data that is being identified and stored. They have a very simple implementation of NoSQL database concepts. They do not have schema hence they are very flexible as well as scalable. The disadvantages of Key Value Pair (KVP) database are that they do not follow ACID (Atomicity, Consistency, Isolation, Durability) properties. Additionally, it will require data architects to plan for data placement, replication as well as high availability. In KVP databases the data is stored as strings. Here is a simple example of how Key Value Database will look like: Key Value Name Pinal Dave Color Blue Twitter @pinaldave Name Nupur Dave Movie The Hero As the number of users grow in Key Value Pair databases it starts getting difficult to manage the entire database. As there is no specific schema or rules associated with the database, there are chances that database grows exponentially as well. It is very crucial to select the right Key Value Pair Database which offers an additional set of tools to manage the data and provides finer control over various business aspects of the same. Riak Rick is one of the most popular Key Value Database. It is known for its scalability and performance in high volume and velocity database. Additionally, it implements a mechanism for collection key and values which further helps to build manageable system. We will further discuss Riak in future blog posts. Key Value Databases are a good choice for social media, communities, caching layers for connecting other databases. In simpler words, whenever we required flexibility of the data storage keeping scalability in mind – KVP databases are good options to consider. Document Database There are two different kinds of document databases. 1) Full document Content (web pages, word docs etc) and 2) Storing Document Components for storage. The second types of the document database we are talking about over here. They use Javascript Object Notation (JSON) and Binary JSON for the structure of the documents. JSON is very easy to understand language and it is very easy to write for applications. There are two major structures of JSON used for Document Database – 1) Name Value Pairs and 2) Ordered List. MongoDB and CouchDB are two of the most popular Open Source NonRelational Document Database. MongoDB MongoDB databases are called collections. Each collection is build of documents and each document is composed of fields. MongoDB collections can be indexed for optimal performance. MongoDB ecosystem is highly available, supports query services as well as MapReduce. It is often used in high volume content management system. CouchDB CouchDB databases are composed of documents which consists fields and attachments (known as description). It supports ACID properties. The main attraction points of CouchDB are that it will continue to operate even though network connectivity is sketchy. Due to this nature CouchDB prefers local data storage. Document Database is a good choice of the database when users have to generate dynamic reports from elements which are changing very frequently. A good example of document usages is in real time analytics in social networking or content management system. Tomorrow In tomorrow’s blog post we will discuss about various other Operational Databases supporting Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Big Data – Operational Databases Supporting Big Data – Columnar, Graph and Spatial Database – Day 14 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the Key-Value Pair Databases and Document Databases in the Big Data Story. In this article we will understand the role of Columnar, Graph and Spatial Database supporting Big Data Story. Now we will see a few of the examples of the operational databases. Relational Databases (The day before yesterday’s post) NoSQL Databases (The day before yesterday’s post) Key-Value Pair Databases (Yesterday’s post) Document Databases (Yesterday’s post) Columnar Databases (Tomorrow’s post) Graph Databases (Today’s post) Spatial Databases (Today’s post) Columnar Databases  Relational Database is a row store database or a row oriented database. Columnar databases are column oriented or column store databases. As we discussed earlier in Big Data we have different kinds of data and we need to store different kinds of data in the database. When we have columnar database it is very easy to do so as we can just add a new column to the columnar database. HBase is one of the most popular columnar databases. It uses Hadoop file system and MapReduce for its core data storage. However, remember this is not a good solution for every application. This is particularly good for the database where there is high volume incremental data is gathered and processed. Graph Databases For a highly interconnected data it is suitable to use Graph Database. This database has node relationship structure. Nodes and relationships contain a Key Value Pair where data is stored. The major advantage of this database is that it supports faster navigation among various relationships. For example, Facebook uses a graph database to list and demonstrate various relationships between users. Neo4J is one of the most popular open source graph database. One of the major dis-advantage of the Graph Database is that it is not possible to self-reference (self joins in the RDBMS terms) and there might be real world scenarios where this might be required and graph database does not support it. Spatial Databases  We all use Foursquare, Google+ as well Facebook Check-ins for location aware check-ins. All the location aware applications figure out the position of the phone with the help of Global Positioning System (GPS). Think about it, so many different users at different location in the world and checking-in all together. Additionally, the applications now feature reach and users are demanding more and more information from them, for example like movies, coffee shop or places see. They are all running with the help of Spatial Databases. Spatial data are standardize by the Open Geospatial Consortium known as OGC. Spatial data helps answering many interesting questions like “Distance between two locations, area of interesting places etc.” When we think of it, it is very clear that handing spatial data and returning meaningful result is one big task when there are millions of users moving dynamically from one place to another place & requesting various spatial information. PostGIS/OpenGIS suite is very popular spatial database. It runs as a layer implementation on the RDBMS PostgreSQL. This makes it totally unique as it offers best from both the worlds. Courtesy: mushroom network Tomorrow In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Hive. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

< Previous Page | 3 4 5 6 7 8  | Next Page >