Search Results

Search found 66013 results on 2641 pages for 'big data analytics'.

Page 4/2641 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Podcast Show Notes: The Big Deal About Big Data

    - by Bob Rhubart
    This week the OTN ArchBeat kicks off a three-part series that looks at Big Data: what it is, its affect on enterprise IT, and what architects need to do to stay ahead of the big data curve. My guests for this conversation are Jean-Pierre Dijks and Andrew Bond . Jean-Pierre, based at Oracle HQ in Redwood Shores, CA, is product manager for Oracle Big Data Appliance and Oracle's big data strategy. Andrew Bond  is Head of Transformation Architecture for Oracle, where he specialzes in Data Warehousing, Business Intelligence, and Big Data. Andrew is based in the UK, but for this conversation he dialed in from a car somewhere on the streets of Amsterdam. Listen to Part 1What is Big Data, really, and why does it matter? Listen to Part 2 (Oct 10)What new challenges does Big Data present for Architects? What do architects need to do to prepare themselves and their environments? Listen to Part 3 (Oct 17)Who is driving the adoption of Big Data strategies in organizations, and why? Additional Resources http://blogs.oracle.com/datawarehousing http://www.facebook.com/pages/OracleBigData https://twitter.com/#!/OracleBigData Coming Soon A conversation about how the rapidly evolving enterprise IT landscape is transforming the roles, responsibilities, and skill requirements for architects and developers. Stay tuned: RSS

    Read the article

  • Unstructured Data - The future of Data Administration

    Some have claimed that there is a problem with the way data is currently managed using the relational paradigm do to the rise of unstructured data in modern business. PCMag.com defines unstructured data as data that does not reside in a fixed location. They further explain that unstructured data refers to data in a free text form that is not bound to any specific structure. With the rise of unstructured data in the form of emails, spread sheets, images and documents the critics have a right to argue that the relational paradigm is not as effective as the object oriented data paradigm in managing this type of data. The relational paradigm relies heavily on structure and relationships in and between items of data. This type of paradigm works best in a relation database management system like Microsoft SQL, MySQL, and Oracle because data is forced to conform to a structure in the form of tables and relations can be derived from the existence of one or more tables. These critics also claim that database administrators have not kept up with reality because their primary focus in regards to data administration deals with structured data and the relational paradigm. The relational paradigm was developed in the 1970’s as a way to improve data management when compared to standard flat files. Little has changed since then, and modern database administrators need to know more than just how to handle structured data. That is why critics claim that today’s data professionals do not have the proper skills in order to store and maintain data for modern systems when compared to the skills of system designers, programmers , software engineers, and data designers  due to the industry trend of object oriented design and development. I think that they are wrong. I do not disagree that the industry is moving toward an object oriented approach to development with the potential to use more of an object oriented approach to data.   However, I think that it is business itself that is limiting database administrators from changing how data is stored because of the potential costs, and impact that might occur by altering any part of stored data. Furthermore, database administrators like all technology workers constantly are trying to improve their technical skills in order to excel in their job, so I think that accusing data professional is not just when the root cause of the lack of innovation is controlled by business, and it is business that will suffer for their inability to keep up with technology. One way for database professionals to better prepare for the future of database management is start working with data in the form of objects and so that they can extract data from the objects so that the stored information within objects can be used in relation to the data stored in a using the relational paradigm. Furthermore, I think the use of pattern matching will increase with the increased use of unstructured data because object can be selected, filtered and altered based on the existence of a pattern found within an object.

    Read the article

  • Google Analytics - Showing multiple site stats at once

    - by John
    Is there a way in google analytics to add multiple sites to and show all the stats together? So like the graphs and total visits/unique hits all combined for all the sites added to the google analytics account? For example if I have: site1.com site2.com site3.com Under one google analytics account, is there a way in google analytics tool to merge them together so I can see a sum of all traffic in one report?

    Read the article

  • Reading data from an Entity Framework data model through a WCF Data Service

    - by nikolaosk
    This is going to be the fourth post of a series of posts regarding ASP.Net and the Entity Framework and how we can use Entity Framework to access our datastore. You can find the first one here , the second one here and the third one here . I have a post regarding ASP.Net and EntityDataSource. You can read it here .I have 3 more posts on Profiling Entity Framework applications. You can have a look at them here , here and here . Microsoft with .Net 3.0 Framework, introduced WCF. WCF is Microsoft's...(read more)

    Read the article

  • Accelerate your SOA with Data Integration - Live Webinar Tuesday!

    - by dain.hansen
    Need to put wind in your SOA sails? Organizations are turning more and more to Real-time data integration to complement their Service Oriented Architecture. The benefit? Lowering costs through consolidating legacy systems, reducing risk of bad data polluting their applications, and shortening the time to deliver new service offerings. Join us on Tuesday April 13th, 11AM PST for our live webinar on the value of combining SOA and Data Integration together. In this webcast you'll learn how to innovate across your applications swiftly and at a lower cost using Oracle Data Integration technologies: Oracle Data Integrator Enterprise Edition, Oracle GoldenGate, and Oracle Data Quality. You'll also hear: Best practices for building re-usable data services that are high performing and scalable across the enterprise How real-time data integration can maximize SOA returns while providing continuous availability for your mission critical applications Architectural approaches to speed service implementation and delivery times, with pre-integrations to CRM, ERP, BI, and other packaged applications Register now for this live webinar!

    Read the article

  • Big Data Videos

    - by Jean-Pierre Dijcks
    You can view them all on YouTube using the following links: Overview for the Boss: http://youtu.be/ikJyrmKdJWc Hadoop: http://youtu.be/acWtid-OOWM Acquiring Big Data: http://youtu.be/TfuhuA_uaho Organizing Big Data: http://youtu.be/IC6jVRO2Hq4 Analyzing Big Data: http://youtu.be/2yf_jrBhz5w These videos are a great place to start learning about big data, the value it can bring to your organisation and how Oracle can help you start working with big data today.

    Read the article

  • Connecting Google Analytics with Custom Search Engine AdSense

    - by Yochai Timmer
    I have a Custom Search Engine that I've created with AdSense. I've put that search engine as a site search in my Google Sites page. I've connected both the Custom Search Engine and the Google Site to my Analytics page via their settings pages. Now, I'm trying to get Analytics to show me the AdSense for Search statistics. I've managed to connect the Google Sites page, to the Analytics, and I can see the search statistics in the Analytics as well. But I can't get it to show the actual AdSense for Search statistics from the Custom Search Engine. How can I configure everything so I can get the AdSense for Search statistics of my Custom Search Engine in my Analytics page?

    Read the article

  • Google Analytics Goal Tracking for Sub-Domains?

    - by Hasan Khan
    I am trying to track goals in Google Analytics for a website that has the goal URL on a sub-domain. The main domain for example is: domain.com and the sub-domain is my.domain.com. I have Google Analytics configured to track domains and all sub-domains and I've eve set up an advanced filter so I can see traffic to my sub-domains in Analytics. However, in goal tracking, you're supposed to put in the website URL after the front (so if it were domain.com/conversions/ you'd put in just /conversions/). However, since for me it would be my.domain.com/conversions/, how would I input that URL into Analytics to track? Would Analytics automatically determine the URL to be on the sub-domain? Thanks!

    Read the article

  • Is it possible to use Google Analytics to track file downloads?

    - by Eric Falsken
    It's always bothered me that Google Analytics (and similar embedded web traffic monitoring services) can only see a reflection of the traffic going to my server and can only see page visits since it depends on the browser executing a Javascript snippet. If I want to track real downloads of a software package (ZIP file), there's no way Google Analytics can possibly tell me that because its javascript can't be attached to a ZIP file. Is there a way I can upload my log files to Google so that the pointy-haired boss can see downloads of our ZIP/PDF/BIN files and not just visits to the download page?

    Read the article

  • Google Analytics setting cookies on static content despite being on entirely separate domain

    - by Donald Jenkins
    I recently decided to comply with the YSlow recommendation that static content is hosted on a cookieless domain. As I already use the root of my domain (donaldjenkins.com) to host my website—on which Google Analytics sets a few cookies—that meant I had to move the CNAME URL for the CDN serving the static files from cdn.donaldjenkins.com to an entirely separate, dedicated domain. I purchased cdn.dj (yes, it's a real Djibouti domain name), hosted the files on the root (which contains nothing else, other than a robots.txt file) and set a CNAME of e.cdn.dj for the CDN. This setup works, but I was rather surprised to find that YSlow was still flagging the static files for not being cookie-free: here's a screenshot: The cdn.djdomain was new, and was never used for anything other than hosting these static files. Running httpfox on the site shows the _utma and _utmz Google Analytics cookies are being set on the static files listed above—despite their being hosted on an entirely separate, dedicated domain. Here's my Google Analytics code: //Google Analytics tracking code var _gaq=[['_setAccount','UA-5245947-5'],['_trackPageview']]; (function(d,t){var g=d.createElement(t),s=d.getElementsByTagName(t)[0]; g.src=('https:'==location.protocol?'//ssl':'//www')+'.google-analytics.com/ga.js'; s.parentNode.insertBefore(g,s)}(document,'script')); // [END] Google Analytics tracking code I'm not obsessing about this issue—I know it's not really affecting server performance—but I'd like to just understand what is causing it not to go away...

    Read the article

  • Referral traffic not appearing properly in Google Analytics

    - by Crashalot
    We have a partnership arrangement with another site where we pay them for users sent to us. However, they claim our referral numbers for them are lower than theirs by 50%. They are tracking clicks in Google Analytics (using events) while we are using visits in Google Analytics. Are we doing something wrong with our Google Analytics installation? <!-- Google Analytics BEGIN --> <script type="text/javascript"> var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-12345678-1']); _gaq.push(['_setDomainName', 'example.com']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); </script> <!-- Google Analytics END -->

    Read the article

  • The Data Scientist

    - by BuckWoody
    A new term - well, perhaps not that new - has come up and I’m actually very excited about it. The term is Data Scientist, and since it’s new, it’s fairly undefined. I’ll explain what I think it means, and why I’m excited about it. In general, I’ve found the term deals at its most basic with analyzing data. Of course, we all do that, and the term itself in that definition is redundant. There is no science that I know of that does not work with analyzing lots of data. But the term seems to refer to more than the common practices of looking at data visually, putting it in a spreadsheet or report, or even using simple coding to examine data sets. The term Data Scientist (as far as I can make out this early in it’s use) is someone who has a strong understanding of data sources, relevance (statistical and otherwise) and processing methods as well as front-end displays of large sets of complicated data. Some - but not all - Business Intelligence professionals have these skills. In other cases, senior developers, database architects or others fill these needs, but in my experience, many lack the strong mathematical skills needed to make these choices properly. I’ve divided the knowledge base for someone that would wear this title into three large segments. It remains to be seen if a given Data Scientist would be responsible for knowing all these areas or would specialize. There are pretty high requirements on the math side, specifically in graduate-degree level statistics, but in my experience a company will only have a few of these folks, so they are expected to know quite a bit in each of these areas. Persistence The first area is finding, cleaning and storing the data. In some cases, no cleaning is done prior to storage - it’s just identified and the cleansing is done in a later step. This area is where the professional would be able to tell if a particular data set should be stored in a Relational Database Management System (RDBMS), across a set of key/value pair storage (NoSQL) or in a file system like HDFS (part of the Hadoop landscape) or other methods. Or do you examine the stream of data without storing it in another system at all? This is an important decision - it’s a foundation choice that deals not only with a lot of expense of purchasing systems or even using Cloud Computing (PaaS, SaaS or IaaS) to source it, but also the skillsets and other resources needed to care and feed the system for a long time. The Data Scientist sets something into motion that will probably outlast his or her career at a company or organization. Often these choices are made by senior developers, database administrators or architects in a company. But sometimes each of these has a certain bias towards making a decision one way or another. The Data Scientist would examine these choices in light of the data itself, starting perhaps even before the business requirements are created. The business may not even be aware of all the strategic and tactical data sources that they have access to. Processing Once the decision is made to store the data, the next set of decisions are based around how to process the data. An RDBMS scales well to a certain level, and provides a high degree of ACID compliance as well as offering a well-known set-based language to work with this data. In other cases, scale should be spread among multiple nodes (as in the case of Hadoop landscapes or NoSQL offerings) or even across a Cloud provider like Windows Azure Table Storage. In fact, in many cases - most of the ones I’m dealing with lately - the data should be split among multiple types of processing environments. This is a newer idea. Many data professionals simply pick a methodology (RDBMS with Star Schemas, NoSQL, etc.) and put all data there, regardless of its shape, processing needs and so on. A Data Scientist is familiar not only with the various processing methods, but how they work, so that they can choose the right one for a given need. This is a huge time commitment, hence the need for a dedicated title like this one. Presentation This is where the need for a Data Scientist is most often already being filled, sometimes with more or less success. The latest Business Intelligence systems are quite good at allowing you to create amazing graphics - but it’s the data behind the graphics that are the most important component of truly effective displays. This is where the mathematics requirement of the Data Scientist title is the most unforgiving. In fact, someone without a good foundation in statistics is not a good candidate for creating reports. Even a basic level of statistics can be dangerous. Anyone who works in analyzing data will tell you that there are multiple errors possible when data just seems right - and basic statistics bears out that you’re on the right track - that are only solvable when you understanding why the statistical formula works the way it does. And there are lots of ways of presenting data. Sometimes all you need is a “yes” or “no” answer that can only come after heavy analysis work. In that case, a simple e-mail might be all the reporting you need. In others, complex relationships and multiple components require a deep understanding of the various graphical methods of presenting data. Knowing which kind of chart, color, graphic or shape conveys a particular datum best is essential knowledge for the Data Scientist. Why I’m excited I love this area of study. I like math, stats, and computing technologies, but it goes beyond that. I love what data can do - how it can help an organization. I’ve been fortunate enough in my professional career these past two decades to work with lots of folks who perform this role at companies from aerospace to medical firms, from manufacturing to retail. Interestingly, the size of the company really isn’t germane here. I worked with one very small bio-tech (cryogenics) company that worked deeply with analysis of complex interrelated data. So  watch this space. No, I’m not leaving Azure or distributed computing or Microsoft. In fact, I think I’m perfectly situated to investigate this role further. We have a huge set of tools, from RDBMS to Hadoop to allow me to explore. And I’m happy to share what I learn along the way.

    Read the article

  • Fraud Detection with the SQL Server Suite Part 2

    - by Dejan Sarka
    This is the second part of the fraud detection whitepaper. You can find the first part in my previous blog post about this topic. My Approach to Data Mining Projects It is impossible to evaluate the time and money needed for a complete fraud detection infrastructure in advance. Personally, I do not know the customer’s data in advance. I don’t know whether there is already an existing infrastructure, like a data warehouse, in place, or whether we would need to build one from scratch. Therefore, I always suggest to start with a proof-of-concept (POC) project. A POC takes something between 5 and 10 working days, and involves personnel from the customer’s site – either employees or outsourced consultants. The team should include a subject matter expert (SME) and at least one information technology (IT) expert. The SME must be familiar with both the domain in question as well as the meaning of data at hand, while the IT expert should be familiar with the structure of data, how to access it, and have some programming (preferably Transact-SQL) knowledge. With more than one IT expert the most time consuming work, namely data preparation and overview, can be completed sooner. I assume that the relevant data is already extracted and available at the very beginning of the POC project. If a customer wants to have their people involved in the project directly and requests the transfer of knowledge, the project begins with training. I strongly advise this approach as it offers the establishment of a common background for all people involved, the understanding of how the algorithms work and the understanding of how the results should be interpreted, a way of becoming familiar with the SQL Server suite, and more. Once the data has been extracted, the customer’s SME (i.e. the analyst), and the IT expert assigned to the project will learn how to prepare the data in an efficient manner. Together with me, knowledge and expertise allow us to focus immediately on the most interesting attributes and identify any additional, calculated, ones soon after. By employing our programming knowledge, we can, for example, prepare tens of derived variables, detect outliers, identify the relationships between pairs of input variables, and more, in only two or three days, depending on the quantity and the quality of input data. I favor the customer’s decision of assigning additional personnel to the project. For example, I actually prefer to work with two teams simultaneously. I demonstrate and explain the subject matter by applying techniques directly on the data managed by each team, and then both teams continue to work on the data overview and data preparation under our supervision. I explain to the teams what kind of results we expect, the reasons why they are needed, and how to achieve them. Afterwards we review and explain the results, and continue with new instructions, until we resolve all known problems. Simultaneously with the data preparation the data overview is performed. The logic behind this task is the same – again I show to the teams involved the expected results, how to achieve them and what they mean. This is also done in multiple cycles as is the case with data preparation, because, quite frankly, both tasks are completely interleaved. A specific objective of the data overview is of principal importance – it is represented by a simple star schema and a simple OLAP cube that will first of all simplify data discovery and interpretation of the results, and will also prove useful in the following tasks. The presence of the customer’s SME is the key to resolving possible issues with the actual meaning of the data. We can always replace the IT part of the team with another database developer; however, we cannot conduct this kind of a project without the customer’s SME. After the data preparation and when the data overview is available, we begin the scientific part of the project. I assist the team in developing a variety of models, and in interpreting the results. The results are presented graphically, in an intuitive way. While it is possible to interpret the results on the fly, a much more appropriate alternative is possible if the initial training was also performed, because it allows the customer’s personnel to interpret the results by themselves, with only some guidance from me. The models are evaluated immediately by using several different techniques. One of the techniques includes evaluation over time, where we use an OLAP cube. After evaluating the models, we select the most appropriate model to be deployed for a production test; this allows the team to understand the deployment process. There are many possibilities of deploying data mining models into production; at the POC stage, we select the one that can be completed quickly. Typically, this means that we add the mining model as an additional dimension to an existing DW or OLAP cube, or to the OLAP cube developed during the data overview phase. Finally, we spend some time presenting the results of the POC project to the stakeholders and managers. Even from a POC, the customer will receive lots of benefits, all at the sole risk of spending money and time for a single 5 to 10 day project: The customer learns the basic patterns of frauds and fraud detection The customer learns how to do the entire cycle with their own people, only relying on me for the most complex problems The customer’s analysts learn how to perform much more in-depth analyses than they ever thought possible The customer’s IT experts learn how to perform data extraction and preparation much more efficiently than they did before All of the attendees of this training learn how to use their own creativity to implement further improvements of the process and procedures, even after the solution has been deployed to production The POC output for a smaller company or for a subsidiary of a larger company can actually be considered a finished, production-ready solution It is possible to utilize the results of the POC project at subsidiary level, as a finished POC project for the entire enterprise Typically, the project results in several important “side effects” Improved data quality Improved employee job satisfaction, as they are able to proactively contribute to the central knowledge about fraud patterns in the organization Because eventually more minds get to be involved in the enterprise, the company should expect more and better fraud detection patterns After the POC project is completed as described above, the actual project would not need months of engagement from my side. This is possible due to our preference to transfer the knowledge onto the customer’s employees: typically, the customer will use the results of the POC project for some time, and only engage me again to complete the project, or to ask for additional expertise if the complexity of the problem increases significantly. I usually expect to perform the following tasks: Establish the final infrastructure to measure the efficiency of the deployed models Deploy the models in additional scenarios Through reports By including Data Mining Extensions (DMX) queries in OLTP applications to support real-time early warnings Include data mining models as dimensions in OLAP cubes, if this was not done already during the POC project Create smart ETL applications that divert suspicious data for immediate or later inspection I would also offer to investigate how the outcome could be transferred automatically to the central system; for instance, if the POC project was performed in a subsidiary whereas a central system is available as well Of course, for the actual project, I would repeat the data and model preparation as needed It is virtually impossible to tell in advance how much time the deployment would take, before we decide together with customer what exactly the deployment process should cover. Without considering the deployment part, and with the POC project conducted as suggested above (including the transfer of knowledge), the actual project should still only take additional 5 to 10 days. The approximate timeline for the POC project is, as follows: 1-2 days of training 2-3 days for data preparation and data overview 2 days for creating and evaluating the models 1 day for initial preparation of the continuous learning infrastructure 1 day for presentation of the results and discussion of further actions Quite frequently I receive the following question: are we going to find the best possible model during the POC project, or during the actual project? My answer is always quite simple: I do not know. Maybe, if we would spend just one hour more for data preparation, or create just one more model, we could get better patterns and predictions. However, we simply must stop somewhere, and the best possible way to do this, according to my experience, is to restrict the time spent on the project in advance, after an agreement with the customer. You must also never forget that, because we build the complete learning infrastructure and transfer the knowledge, the customer will be capable of doing further investigations independently and improve the models and predictions over time without the need for a constant engagement with me.

    Read the article

  • Big Data – Buzz Words: What is HDFS – Day 8 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned what is MapReduce. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – HDFS. What is HDFS ? HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure. HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system. Architecture of HDFS The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes. The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode. In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language. Visual Representation of HDFS Architecture Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode. High Availability During Disaster Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability. If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails. The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware. Tomorrow In tomorrow’s blog post we will discuss the importance of the relational database in Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • New Big Data Appliance Security Features

    - by mgubar
    The Oracle Big Data Appliance (BDA) is an engineered system for big data processing.  It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing.  The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security.  Oracle Database customers have benefited from a rich set of security features:  encryption, redaction, data masking, database firewall, label based access control – and much, much more.  They want similar capabilities with their Hadoop cluster.    Unfortunately, Hadoop wasn’t developed with security in mind.  By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database.  Some critical security features have been implemented – but even those capabilities are arduous to setup and configure.  Oracle believes that a key element of an optimized appliance is that its data should be secure.  Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing. Security Starts at Authentication A successful security strategy is predicated on strong authentication – for both users and software services.  Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”.  The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users. Default Authentication in Hadoop By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system.  Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user.  In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt.  When logged in as the hr user, you can see the following files.  Notice, we’re using the Hadoop command line utilities for accessing the data: $ hadoop fs -ls /user/hrdataFound 1 items-rw-r--r--   1 oracle supergroup         70 2013-10-31 10:38 /user/hrdata/salaries.txt$ hadoop fs -cat /user/hrdata/salaries.txtTom Brady,11000000Tom Hanks,5000000Bob Smith,250000Oprah,300000000 User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”.  $ hadoop fs -ls /user Found 1 items drwx------   - hr supergroup          0 2013-10-31 10:38 /user/hrdata However, DrEvil cannot view the contents of the folder due to lack of access privileges: $ hadoop fs -ls /user/hrdata ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------ Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster: $ sudo useradd hr $ sudo passwd $ su hr $ hadoop fs -cat /user/hrdata/salaries.txt Tom Brady,11000000 Tom Hanks,5000000 Bob Smith,250000 Oprah,300000000 Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS. Big Data Appliance Provides Secure Authentication The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration. Figure 1: Kerberos Integration The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc. At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system. It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment. Authorize Access to Sensitive Data Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server. For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration. Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements: Secure Authorization:  the ability to control access to data and/or privileges on data for authenticated users. Fine-Grained Authorization:  the ability to give users access to a subset of the data (e.g. column) in a database Role-Based Authorization:  the ability to create/apply template-based privileges based on functional roles. With Sentry, “all”, “select” or “insert” privileges are granted to an object. The descendants of that object automatically inherit that privilege. A collection of privileges across many objects may be aggregated into a role – and users/groups are then assigned that role. This leads to simplified administration of security across the system. Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views. Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future. Audit Hadoop Cluster Activity Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster: Figure 3: Monitored Hadoop services. At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include: MapReduce:  correlate the MapReduce job that accessed the file Oozie:  describes who ran what as part of a workflow Hive:  captures changes were made to the Hive metadata The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA. Figure 4: Consolidated audit data across the enterprise.  Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies. Conclusion Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.

    Read the article

  • How do you track display impressions in Google Analytics on non Google networks?

    - by dee
    Google Analytics has a Multi-Channel funnel analysis feature that we’d like to use to understand assisted conversions and how each channel has impacted on conversion beyond just last interaction attribution. My current understanding is that the impression tracking part of this feature works really well when playing within Google’s search and display networks. Outside of Google’s network I suspect that impression tracking will no longer “just work” and feed back into GA appropriately. What our options are for tracking display impressions on other advertising networks so that we can be attributing value correctly with GA?

    Read the article

  • Google Analytics: Why is Avg Time on Site lower than Avg time on Page?

    - by Melanie
    I have the following Custom Report set up in Google Analytics: Metrics: Avg Time on Page Avg Time on Site Dimensions: Page So a report looks like this: Page Avg Time on Page Avg Time on Site /an-article 00:03:14 00:00:11 /another-article 00:05:11 00:01:07 /something-written 00:03:00 00:00:31 Why is it that for each 'page', the 'site views' are significantly lower?

    Read the article

  • No search data in Google Analytics or Webmasters

    - by cjk
    I have a domain that has been registered in Google Webmasters and using Google Analytics for over 4 months. I get lots of analytics data, but am getting no information on Google searches in Webmasters, or Queries in Search Engine Optimisation in Analytics, even though I am getting keywords for traffic coming to my site from search engines. I have a test sub-domain with the same setup (except not HTTPS) that is getting some of this information through, even with much less data and visits. What could be wrong to stop me getting this information?

    Read the article

  • CRM@Oracle Series: CRM Analytics

    - by tony.berk
    What is the most important factor that leads to a successful CRM deployment? Is it the overall strategy, strong governance, defined processes or good data quality? Well, it's definitely a combination of all these, but the most important differentiator from our experience is Business Intelligence. Business Intelligence or Analytics is commonly mentioned as a key aspect to successful CRM and other enterprise deployments. The good news is that Oracle provides pre-built analytics dashboards, which provide real-time, actionable insight, and tools to build custom analyses. However, success with analytics, especially in a large enterprise, still requires a strong strategy, clean data for analysis, and performance. Today's CRM@Oracle slidecast covers Oracle's strategy, architecture and key success factors for deploying CRM Analytics internally at Oracle. CRM@Oracle: CRM Analytics Click here to learn more about Oracle CRM products and here to learn about Oracle Business Intelligence Applications. Have you read our other postings in the CRM@Oracle Series? If you have a particular CRM area or function which you'd like to hear how Oracle implemented it internally, post a comment and we'll get it on our list.

    Read the article

  • No search data in Goolge Analytics or Webmasters

    - by cjk
    I have a domain that has been registered in Google Webmasters and using Google Analytics for over 4 months. I get lots of analytics data, but am getting no information on Google searches in Webmasters, or Queries in Search Engine Optimisation in Analytics, even though I am getting keywords for traffic coming to my site from search engines. I have a test sub-domain with the same setup (except not HTTPS) that is getting some of this information through, even with much less data and visits. What could be wrong to stop me getting this information?

    Read the article

  • Setting up Google Analytics for multiple subdomains

    - by Andrew G. Johnson
    so first here's a snippet of my current Analytics javascript: var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-30490730-1']); _gaq.push(['_setDomainName', '.apartmentjunkie.com']); _gaq.push(['_setSiteSpeedSampleRate', 100]); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0];s.parentNode.insertBefore(ga, s); })(); So if you wanna have a quick peak at the site the url is ApartmentJunkie.com, keep in mind the site is pretty bare bones but you'll get the idea -- basically it's very similar to craigslist in the sense that it's in the local space so people pick a city then get sent to a subdomain that is specific for that city, e.g. winnipeg.mb.apartmentjunkie.com. I put that up late last night then had a look at the analytics and found that I am seeing only the request uri portion of the URLs in analytics as I would with any other site only with this one it's a problem as winnipeg.mb.apartmentjunkie.com/map/ and brandon.mb.apartmentjunkie.com/map/ are two different pages and shouldn't be lumped together as /map/ I know the kneejerk response is likely going to be "hey just setup a different google analytics profile for each subdomain" but there will eventually be a lot of subdomains so google's cap of 50 is going to be too limited and even more important I want to see the data in aggregate for the most part. I am thinking of making a change to the javascript, to something like: _gaq.push(['_trackPageview',String(document.domain) + String(document.location)]); But am unsure if this is the best way and figured someone else on wm.se would have had a similar situation that they could talk a bit about.

    Read the article

  • 2 google analytics profiles for 2 sections of the same site

    - by sam
    Ive got a website which for the most part is a portfolio, there is another section of the site mysite.com/micro-site which ranks extremely well for it chosen term / topic, and brings in lots of traffic, but actually has little to do with the core business. It was really made as a piece of content - in the same way sites like this are - http://chrome.com/campaigns/rollit For the main site i use 1 Google analytics profile and set of tags, for the micro site i have a completely different analytics profile and set of tags. The main reason ive done this is because the traffic stats and insights for the micro site are essentially just noise, its nice to have the traffic but they dont help when reading analytics reports, so if they were combined my analytics reports would be a mess. Is there any disadvantage / negatives of doing this ?

    Read the article

  • Google Analytics on Android

    - by pjv
    There is a specific and official analytics SDK for native Android apps (note that I'm not talking about webpages in apps on a phone). This library basically sends pages and events to Google Analytics and you can view your analytics in exactly the same dashboard as for websites. Since my background is apps rather than websites, and since a lot of the Google Analytics terminology seems particularly inapplicable to a native app, I need some pointers. Please discuss my remarks, provide some clarification where you think I'm off-track, and above all share good experiences! 1. Page Views Pages mostly can match different Activities (and Dialogs) being displayed. Activities can be visible behind non-full-screen Activities however, though only the top-level Activity can be interacted. This sort-off clashes with a "(page) view". You'd also want at least one page view for each visit and therefore put one page view tracker in the Application class. However this does not constitute a window or sorts. Usually an Activity will open at the same time, so the time spent on that page will have been 0. This will influence your "time spent" statistics. How are these counted anyway? Moreover, there is a loose coupling between the Activities, by means of Intents. A user can, much like on any website, step in at any Activity, although usually this then concerns resuming the application where he left off. This makes that the hierarchy of Activities usually is very flat. And since there are no url's involved. What meaning would using slashes in page titles have, such as "/Home"? All pages would appear on an equal level in the reports, so no content drilldown. Non-unique page views seem to be counted as some kind of indicator of successfulness: how often does the visitor revisit the page. When the user rotates the screen however usually an Activity resumes again, thus making it a new page view. This happens a lot. Maybe a well-thought-through placement of the call might solve this, or placing several, I'm not sure. How to deal with Page Views? 2. Events I'd say there are two sorts: A user event Something that happened, usually as an indirect consequence of the above. The latter particularly is giving me headaches. First of all, many events aren't written in code any more, but pieced logically together by means of Intents. This means that there is no place to put the analytics call. You'd either have to give up this advantage and start doing it the old-fashioned way in favor of good analytics, or, just be missing some events. Secondly, as a developer you're not so much interested in when a user clicks a button, but if the action that should have been performed really was performed and what the result was. There seems to be no clear way to get resulting data into Google Analytics (what's up with the integers? I want to put in Strings!). The same that applies to the flat pages hierarchy, also goes for the event categories. You could do "vertical" categories (topically, that is), but some code is shared "horizontally" and the tracking will be equally shared. Just as with the Intents mechanism, inheritance makes it hard for you to put the tracking in the right places at all times. And I can't really imagine "horizontal" categories. Unless you start making really small categories, such as all the items form the same menu in one category, I have a hard time grasping the concept. Finally, how do you deal with cancelling? Usually you both have an explicit cancel mechanism by ways of a button, as well as the implicit cancel when the "back"-button is pressed to leave the activity and there were no changes. The latter also applies to "saves", when the back button is pressed and there ARE changes. How are you consequently going to catch all these if not by doing all the "back"-button work yourself? How to deal with events? 3. Goals For goal types I have choice of: URL Destination, Time on Site, and Pages/Visit. Most apps don't have a funnel that leads the user to some "registration done" or "order placed" page. Apps have either already been bought (in which case you want to stimulate the user to love your app, so that he might bring on new buyers) or are paid for by in-app ads. So URL Destination is not a very important goal. Time on Site also seems troublesome. First, I have some doubt on how this would be measured. Second, I don't necessarily want my user to spend a lot of time in my already paid app, just be active and content. Equivalently, why not mention how frequent a user uses your app? Regarding Pages/Visit I already mentioned how screen orientation changes blow up the page view numbers. In an app I'd be most interested in events/visit to measure the user's involvement/activity. If he's intensively using the app then he must be loving it right? Furthermore, I also have some small funnels (that do not lead to conversion though) that I want to see streamlined. In my mind those funnels would end in events rather than page views but that seems not to be possible. I could also measure clickthroughs on in-app ads, but then I'd need to track those as Page Views rather than Events, in view of "URL Destination". What are smart goals for apps and how can you fit them on top of Analytics? 4. Optimisation Is there a smart way to manually do what "Website Optimiser" does for websites? Most importantly, how would I track different landing page designs? 5. Traffic Sources Referrals deal with installation time referrals, if you're smart enough to get them included. But perhaps I'd also want to get some data which third-party app sends users to my app to perform some actions (this app interoperability is possible via Intents). Many of the terminologies related to "Traffic Sources" seem totally meaningless and there is no possibility of connecting in AdSense. What are smart uses of this data? 6. Visitors Of the "Browser capabilities", "Network Properties" and "Mobile" tabs, many things are pointless as they have no influence on / relation with my mostly offline app that won't use flash anyway. Only if you drill down far enough, can you get to OS versions, which do matter a lot. I even forgot where you could check what exact Android devices visited. What are smart uses of this data? How can you make the relevant info more prominent? 7. Other No in-page analytics. I have to register my app as a web-url (What!?)?

    Read the article

  • Anonymouse VS Logged in users on my site & Google Analytics

    - by Flowpoke
    I'd like to be able to run two different 'tracks' for Google Analytics; One for anonymous users of the site and another for Users whom are logged-in. I say "track" because Im not sure of the term--but I definitely know I want it to all be in the same "Analytics Account", I just want to segregate my logged-in users... In the site template, I can very easily add a conditional to display one or the other (Analytics code snippet)... Which Im hoping this comes down to and although Im not sure, it seems that the last digit in your Analytics ID (e.g. UA-15XXXX0-X) could be incremented to gain such additional 'tracks'....? Any tips? Am I doin it wrong? My current footer snippet: <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-XXXXXXX-1"); pageTracker._trackPageview(); } catch(err) {} </script>

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >