Search Results

Search found 11362 results on 455 pages for 'big o analysis'.

Page 2/455 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

The Oldest Big Data Problem: Parsing Human Language

- by dan.mcclary

There's a new whitepaper up on Oracle Technology Network which details the use of Digital Reasoning Systems' Synthesys software on Oracle Big Data Appliance. Digital Reasoning's approach is inherently "big data friendly," as it leverages multiple components of the Hadoop ecosystem. Moreover, the paper addresses the oldest big data problem of them all: extracting knowledge from human text. You can find the paper here. From the Executive Summary: There is a wealth of information to be extracted from natural language, but that extraction is challenging. The volume of human language we generate constitutes a natural Big Data problem, while its complexity and nuance requires a particular expertise to model and mine. In this paper we illustrate the impressive combination of Oracle Big Data Appliance and Digital Reasoning Synthesys software. The combination of Synthesys and Big Data Appliance makes it possible to analyze tens of millions of documents in a matter of hours. Moreover, this powerful combination achieves four times greater throughput than conducting the equivalent analysis on a much larger cloud-deployed Hadoop cluster.

Read the article
White Paper on Analysis Services Tabular Large-scale Solution #ssas #tabular

- by Marco Russo (SQLBI)

Since the first beta of Analysis Services 2012, I worked with many companies designing and implementing solutions based on Analysis Services Tabular. I am glad that Microsoft published a white paper about a case-study using one of these scenarios: An Analysis Services Case Study: Using Tabular Models in a Large-scale Commercial Solution. Alberto Ferrari is the author of the white paper and many people contributed to it. The final result is a very technical document based on a case study, which provides a level of detail that I don’t see often in other case studies (which are usually more marketing-oriented). This white paper has the following structure: Requirements (data model, capacity planning, client tool) Options considered (SQL Server Columnstore Indexes, SSAS Multidimensional, SSAS Tabular) Data Model optimizations (memory compression, query performance, scalability) Partitioning and Processing strategy for near real-time latency Hardware selection (NUMA analysis, Azure VM tests) Scalability tests (estimation of maximum users per node) If you are in charge of evaluating Tabular as analytical engine, or if you have to design your solution based on Tabular, this white paper is a must read. But if you just want to increase your knowledge of Analysis Services, you will find a lot of useful technical information. That said, my favorite quote of the document is the following one, funny but true: […] After several trials, the clear winner was a video gaming machine that one guy on the team used at home. That computer outperformed any available server, running twice as fast as the server-class machines we had in house. At that point, it was clear that the criteria for choosing the server would have to be expanded a bit, simply because it would have been impossible to convince the boss to build a cluster of gaming machines and trust it to serve our customers. But, honestly, if a business has the flexibility to buy gaming machines (assuming the machines can handle capacity) – do this. Owen Graupman, inContact I want to write a longer discussion about how companies are adopting Tabular in scenarios where it is the hidden engine of a more complex solution (and not the classical “BI system”), because it is more frequent than you might expect (and has several advantages over many alternative approaches).

Read the article
Basket Analysis with #dax in #powerpivot and #ssas #tabular

- by Marco Russo (SQLBI)

A few days ago I published a new article on DAX Patterns web site describing how to implement Basket Analysis in DAX. This topic is a very classical one and is also covered in the many-to-many revolution white paper. It has been also discussed in several blog posts, listed here in historical order: Simple Basket Analysis in DAX by Chris Webb PowerPivot, basket analysis and the hidden many to many by Alberto Ferrari Applied Basket Analysis in Power Pivot using DAX by Gerhard Brueckl As usual, in DAX Patterns we try to present the required DAX formulas in a way that is easy to adapt to specific models. We also try to show a good implementation from a performance point of view. Further optimizations are always possible in DAX. However, in order to keep the model simple to adapt in different scenarios, we avoid presenting optimizations that would require particular assumptions or restrictions on the data model. I hope you will find the Basket Analysis pattern useful. Even if you do not need it today, reading the DAX formula is a good exercise to check your knowledge of evaluation contexts in DAX. For example, describing how does it work the following expression is not a trivial task! [Orders with Both Products] := CALCULATE ( DISTINCTCOUNT ( Sales[SalesOrderNumber] ), CALCULATETABLE ( SUMMARIZE ( Sales, Sales[SalesOrderNumber] ), ALL ( Product ), USERELATIONSHIP ( Sales[ProductCode], 'Filter Product'[Filter ProductCode] ) ) ) The good news is that you can use the patterns even if you do not really understand all the details of the DAX formulas you are using! Any feedback on this new pattern is very welcome.

Read the article
How meaningful is the Big-O time complexity of an algorithm?

- by james creasy

Programmers often talk about the time complexity of an algorithm, e.g. O(log n) or O(n^2). Time complexity classifications are made as the input size goes to infinity, but ironically infinite input size in computation is not used. Put another way, the classification of an algorithm is based on a situation that algorithm will never be in: where n = infinity. Also, consider that a polynomial time algorithm where the exponent is huge is just as useless as an exponential time algorithm with tiny base (e.g., 1.00000001^n) is useful. Given this, how much can I rely on the Big-O time complexity to advise choice of an algorithm?

Read the article
Big Data Accelerator

- by Jean-Pierre Dijcks

For everyone who does not regularly listen to earnings calls, Oracle's Q4 call was interesting (as it mostly is). One of the announcements in the call was the Big Data Accelerator from Oracle (Seeking Alpha link here - slightly tweaked for correctness shown below): "The big data accelerator includes some of the standard open source software, HDFS, the file system and a number of other pieces, but also some Oracle components that we think can dramatically speed up the entire map-reduce process. And will be particularly attractive to Java programmers [...]. There are some interesting applications they do, ETL is one. Log processing is another. We're going to have a lot of those features, functions and pre-built applications in our big data accelerator." Not much else we can say right now, more on this (and Big Data in general) at Openworld!

Read the article
Using Analysis Services Processing Task & Analysis Services Execute DDL Task in SSIS

SQL Server Integration Services (SSIS) is a Business Intelligence tool which can be used by database developers or administrators to perform Extract, Transform & Load (ETL) operations. In my previous article Using WMI Data Reader and WMI Event Watcher ... [Read Full Article]

Read the article
Big Data – Buzz Words: What is HDFS – Day 8 of 21

- by Pinal Dave

In yesterday’s blog post we learned what is MapReduce. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – HDFS. What is HDFS ? HDFS stands for Hadoop Distributed File System and it is a primary storage system used by Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed on low-cost commodity hardware. In commodity hardware deployment server failures are very common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of failure. HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each smaller piece to multiple times on different nodes. Hence when any node with the data crashes the system is automatically able to use the data from a different node and continue the process. This is the key feature of the HDFS system. Architecture of HDFS The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of single NameNode. This single NameNode is a master server and it manages the file system as well regulates access to various files. In additional to NameNode there are multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file is split into one or more blocks and those blocks are stored in a set of DataNodes. The primary task of the NameNode is to open, close or rename files and directory and regulate access to the file system, whereas the primary task of the DataNode is read and write to the file systems. DataNode is also responsible for the creation, deletion or replication of the data based on the instruction from NameNode. In reality, NameNode and DataNode are software designed to run on commodity machine build in Java language. Visual Representation of HDFS Architecture Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks directly to the DataNode. Client App does not have to directly write to all the node. It just has to write to any one of the node and NameNode will decide on which other DataNode it will have to replicate the data. In our example Client App directly writes to DataNode 1 and detained 3. However, data chunks are automatically replicated to other nodes. All the information like in which DataNode which data block is placed is written back to NameNode. High Availability During Disaster Now as multiple DataNode have same data blocks in the case of any DataNode which faces the disaster, the entire process will continue as other DataNode will assume the role to serve the specific data block which was on the failed node. This system provides very high tolerance to disaster and provides high availability. If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop Application will stop performing as it is a single node where we store all the metadata. As this node is very critical, it is usually replicated on another clustered as well as on another data rack. Though, that replicated node is not operational in architecture, it has all the necessary data to perform the task of the NameNode in the case of the NameNode fails. The entire Hadoop architecture is built to function smoothly even there are node failures or hardware malfunction. It is built on the simple concept that data is so big it is impossible to have come up with a single piece of the hardware which can manage it properly. We need lots of commodity (cheap) hardware to manage our big data and hardware failure is part of the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to overcome the limitation of the non-functioning hardware. Tomorrow In tomorrow’s blog post we will discuss the importance of the relational database in Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Big Data Learning Resources

- by Lara Rubbelke

I have recently had several requests from people asking for resources to learn about Big Data and Hadoop. Below is a list of resources that I typically recommend. I'll update this list as I find more resources. Let's crowdsource this... Tell me your favorite resources and I'll get them on the list! Books and Whitepapers Planning for Big Data Free e-book Great primer on the general Big Data space. This is always my recommendation for people who are new to Big Data and are trying to understand it....(read more)

Read the article
E-Book on big data (featuring Analysts, Customers and more)

- by Jean-Pierre Dijcks

As we are gearing up for Openworld, here is a nice E-book on big data to start paging through. It contains Gartner's take on big data, customer and partner interviews and a lot more good info. Enjoy the read so you come prepared for Openworld!! Read the E-Book here. For those coming to Oracle Openworld (or the Americas Cup races around the same time), you can find big data sessions via this URL. Enjoy!!

Read the article
How much system and business analysis should a programmer be reasonably expected to do?

- by Rahul

In most places I have worked for, there were no formal System or Business Analysts and the programmers were expected to perform both the roles. One had to understand all the subsystems and their interdependencies inside out. Further, one was also supposed to have a thorough knowledge of the business logic of the applications and interact directly with the users to gather requirements, answer their queries etc. In my current job, for ex, I spend about 70% time doing system analysis and only 30% time programming. I consider myself a good programmer but struggle with developing a good understanding of the business rules of a complex application. Often, this creates a handicap because while I can write efficient algorithms and thread-safe code, I lose out to guys who may be average programmers but have a much better understanding of the business processes. So I want to know - How much business and systems knowledge should a programmer have ? - How does one go about getting this knowledge in an immensely complex software system (e.g. trading applications) with several interdependent business processes but poorly documented business rules.

Read the article
Analysis Services Tabular books #ssas #tabular

- by Marco Russo (SQLBI)

Many people are looking for books about Analysis Services Tabular. Today there are two books available and they complement each other: Microsoft SQL Server 2012 Analysis Services: The BISM Tabular Model by Marco Russo, Alberto Ferrari and Chris Webb Applied Microsoft SQL Server 2012 Analysis Services: Tabular Modeling by Teo Lachev The book I wrote with Alberto and Chris is a complete guide to create tabular models and has a good coverage about DAX, including how to use it for enriching a semantic model with calculated columns and measures and how to use it for querying a Tabular model. In my experience, DAX as a query language is a very interesting option for custom analytical applications that requires a fast calculation engine, or simply for standard reports running in Reporting Services and accessing a Tabular model. You can freely preview the table of content and read some excerpts from the book on Safari Books Online. The book is in printing and should be shipped within mid-July, so finally it will be very soon on the shelf of all the people already preordered it! The Teo Lachev’s book, covers the full spectrum of Tabular models provided by Microsoft: starting with self-service BI, you have users creating a model with PowerPivot for Excel, publishing it to PowerPivot for SharePoint and exploring data by using Power View; then, the PowerPivot for Excel model can be imported in a Tabular model and published in Analysis Services, adding more control on the model through row-level security and partitioning, for example. Teo’s book follows a step-by-step approach describing each feature that is very good for a beginner that is new to PowerPivot and/or to BISM Tabular. If you need to get the big picture and to start using the products that are part of the new Microsoft wave of BI products, the Teo’s book is for you. After you read the book from Teo, or if you already have a certain confidence with PowerPivot or BISM Tabular and you want to go deeper about internals, best practices, design patterns in just BISM Tabular, then our book is a suggested read: it contains several chapters about DAX, includes discussions about new opportunities in data model design offered by Tabular models, and also provides examples of optimizations you can obtain in DAX and best practices in data modeling and queries. It might seem strange that an author write a review of a book that might seem to compete with his one, but in reality these two books complement each other and are not alternatives. If you have any doubt, buy both: you will be not disappointed! Moreover, Amazon usually offers you a deal to buy three books, including the Visualizing Data with Microsoft Power View, another good choice for getting all the details about Power View.

Read the article
Big data: An evening in the life of an actual buyer

- by Jean-Pierre Dijcks

Here I am, and this is an actual story of one of my evenings, trying to spend money with a company and ultimately failing. I just gave up and bought a service from another vendor, not the incumbent. Here is that story and how I think big data could actually fix this (and potentially prevent some of this from happening). In the end this story should illustrate how big data can benefit me (get me what I want without causing grief) and the company I am trying to buy something from. Note: Lots of details left out, I have no intention of being the annoyed blogger moaning about a specific company. What did I want to get? We watch TV, we have internet and we do have a land line. The land line is from a different vendor then the TV and the internet. I have decided that this makes no sense and I was going to get a bundle (no need to infer who this is, I just picked the generic bundle word as this is what I want to get) of all three services as this seems to save me money. I also want to not talk to people, I just want to click on a website when I feel like it and get it all sorted. I do think that is reality. I want to just do my shopping at 9.30pm while watching silly reruns on TV. Problem 1 - Bad links So, I'm an existing customer of the company I want to buy my bundle from. I go to the website, I click on offers. Turns out they are offers for new customers. After grumbling about how good they are, I click on offers for existing customers. Bummer, it goes to offers for new customers, so I click again on the link for offers for existing customers. No cigar... it just does not work. Big data solutions: 1) Do not show an existing customer the offers for new customers unless they are the same => This is only partially doable without login, but if a customer logs in the application should always know that this is an existing customer. But in general, imagine I do this from my home going through the internet service of this vendor to their domain... an instant filter should move me into the "existing customer route". 2) Flag dead or incorrect links => I've clicked the link for "existing customer offers" at least 3 times in under 5 seconds... Identifying patterns like this is easy in Hadoop and can very quickly make a list of potentially incorrect links. No need for realtime fixing, just the fact that this link can be pro-actively fixed across my entire web domain is a good thing. Preventative maintenance! Problem 2 - Purchase cannot be completed Apart from the fact that the browsing pattern to actually get to what I want is poorly designed, my purchase never gets past a specific point. In other words, I put something into my shopping cart and when I want to move on the application either crashes (with me going to an error page) or hangs or goes into something like chat. So I try again, and again and again. I think I tried this entire path (while being logged in!!) at least 10 times over the course of 20 minutes. I also clicked on the feedback button and, frustrated as I was, tried to explain this did not work... Big Data Solutions: 1) This web site does shopping cart analysis. I got an email next day stating I have things in my shopping cart, just click here to complete my purchase. After the above experience, this just added insult to my pain... 2) What should have happened, is a Hadoop job going over all logged in customers that are on the buy flow. It should flag anyone who is trying (multiple attempts from the same user to do the same thing), analyze the shopping card, the clicks to identify what the customers wants, his feedback provided (note: always own your own website feedback, never just farm this out!!) and in a short turn around time (30 minutes to 2 hours or so) email me with a link to complete my purchase. Not with a link to my shopping cart 12 hours later, but a link to actually achieve what I wanted... Why should this company go through the big data effort? I do believe this is relatively easy to do using our Oracle Event Processing and Big Data Appliance solutions combined. It is almost so simple (to my mind) that it makes no sense that this is not in place? But, now I am ranting... Why is this interesting? It is because of $$$$. After trying really hard, I mean I did this all in the evening, and again in the morning before going to work. I kept on failing, But I really wanted this to work... so an email that said, sorry, we noticed you tried to get a bundle (the log knows what I wanted, where I failed, so easy to generate), here is the link to click and complete your purchase. And here is 2 movies on us as an apology would have kept me as a customer, and got the additional $$$$ per month for the next couple of years. It would also lead to upsell on my phone package etc. Instead, I went to a completely different company, bought service from them. Lost money for company A, negative sentiment for company A and me telling this story at the water cooler so I'm influencing more people to think negatively about company A. All in all, a loss of easy money, a ding in sentiment and image where a relatively simple solution exists and can be in place on the software I describe routinely in this blog... For those who are coming to Openworld and maybe see value in solving the above, or are thinking of how to solve this, come visit us in Moscone North - Oracle Red Lounge or in the Engineered Systems Showcase.

Read the article
Big Data – Buzz Words: What is MapReduce – Day 7 of 21

- by Pinal Dave

In yesterday’s blog post we learned what is Hadoop. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – MapReduce. What is MapReduce? MapReduce was designed by Google as a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary technology, it has been quite a generalized term in the recent time. MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering and sorting operation on data where as procedure Reduce() performs a summary operation of the data. This model is based on modified concepts of the map and reduce functions commonly available in functional programing. The library where procedure Map() and Reduce() belongs is written in many different languages. The most popular free implementation of MapReduce is Apache Hadoop which we will explore tomorrow. Advantages of MapReduce Procedures The MapReduce Framework usually contains distributed servers and it runs various tasks in parallel to each other. There are various components which manages the communications between various nodes of the data and provides the high availability and fault tolerance. Programs written in MapReduce functional styles are automatically parallelized and executed on commodity machines. The MapReduce Framework takes care of the details of partitioning the data and executing the processes on distributed server on run time. During this process if there is any disaster the framework provides high availability and other available modes take care of the responsibility of the failed node. As you can clearly see more this entire MapReduce Frameworks provides much more than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical implementation of the MapReduce Framework processes many petabytes of data and thousands of the processing machines. How do MapReduce Framework Works? A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is the basic explanation of the MapReduce Procedures which uses this massive commodity of the servers. Map() Procedure There is always a master node in this infrastructure which takes an input. Right after taking input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are distributed to worker nodes. A worker node later processes them and does necessary analysis. Once the worker node completes the process with this sub-problem it returns it back to master node. Reduce() Procedure All the worker nodes return the answer to the sub-problem assigned to them to master node. The master node collects the answer and once again aggregate that in the form of the answer to the original big problem which was assigned master node. The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and independent to each other. All the Map() procedures can run parallel to each other and once each worker node had completed their task they can send it back to master code to compile it with a single answer. This particular procedure can be very effective when it is implemented on a very large amount of data (Big Data). The MapReduce Framework has five different steps: Preparing Map() Input Executing User Provided Map() Code Shuffle Map Output to Reduce Processor Executing User Provided Reduce Code Producing the Final Output Here is the Dataflow of MapReduce Framework: Input Reader Map Function Partition Function Compare Function Reduce Function Output Writer In a future blog post of this 31 day series we will explore various components of MapReduce in Detail. MapReduce in a Single Statement MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large database. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – HDFS. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Big Data – Operational Databases Supporting Big Data – Key-Value Pair Databases and Document Databases – Day 13 of 21

- by Pinal Dave

In yesterday’s blog post we learned the importance of the Relational Database and NoSQL database in the Big Data Story. In this article we will understand the role of Key-Value Pair Databases and Document Databases Supporting Big Data Story. Now we will see a few of the examples of the operational databases. Relational Databases (Yesterday’s post) NoSQL Databases (Yesterday’s post) Key-Value Pair Databases (This post) Document Databases (This post) Columnar Databases (Tomorrow’s post) Graph Databases (Tomorrow’s post) Spatial Databases (Tomorrow’s post) Key Value Pair Databases Key Value Pair Databases are also known as KVP databases. A key is a field name and attribute, an identifier. The content of that field is its value, the data that is being identified and stored. They have a very simple implementation of NoSQL database concepts. They do not have schema hence they are very flexible as well as scalable. The disadvantages of Key Value Pair (KVP) database are that they do not follow ACID (Atomicity, Consistency, Isolation, Durability) properties. Additionally, it will require data architects to plan for data placement, replication as well as high availability. In KVP databases the data is stored as strings. Here is a simple example of how Key Value Database will look like: Key Value Name Pinal Dave Color Blue Twitter @pinaldave Name Nupur Dave Movie The Hero As the number of users grow in Key Value Pair databases it starts getting difficult to manage the entire database. As there is no specific schema or rules associated with the database, there are chances that database grows exponentially as well. It is very crucial to select the right Key Value Pair Database which offers an additional set of tools to manage the data and provides finer control over various business aspects of the same. Riak Rick is one of the most popular Key Value Database. It is known for its scalability and performance in high volume and velocity database. Additionally, it implements a mechanism for collection key and values which further helps to build manageable system. We will further discuss Riak in future blog posts. Key Value Databases are a good choice for social media, communities, caching layers for connecting other databases. In simpler words, whenever we required flexibility of the data storage keeping scalability in mind – KVP databases are good options to consider. Document Database There are two different kinds of document databases. 1) Full document Content (web pages, word docs etc) and 2) Storing Document Components for storage. The second types of the document database we are talking about over here. They use Javascript Object Notation (JSON) and Binary JSON for the structure of the documents. JSON is very easy to understand language and it is very easy to write for applications. There are two major structures of JSON used for Document Database – 1) Name Value Pairs and 2) Ordered List. MongoDB and CouchDB are two of the most popular Open Source NonRelational Document Database. MongoDB MongoDB databases are called collections. Each collection is build of documents and each document is composed of fields. MongoDB collections can be indexed for optimal performance. MongoDB ecosystem is highly available, supports query services as well as MapReduce. It is often used in high volume content management system. CouchDB CouchDB databases are composed of documents which consists fields and attachments (known as description). It supports ACID properties. The main attraction points of CouchDB are that it will continue to operate even though network connectivity is sketchy. Due to this nature CouchDB prefers local data storage. Document Database is a good choice of the database when users have to generate dynamic reports from elements which are changing very frequently. A good example of document usages is in real time analytics in social networking or content management system. Tomorrow In tomorrow’s blog post we will discuss about various other Operational Databases supporting Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Big Data – Buzz Words: What is NoSQL – Day 5 of 21

- by Pinal Dave

In yesterday’s blog post we explored the basic architecture of Big Data . In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – NoSQL. What is NoSQL? NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL means there is No SQL, which is not true – they both sound same but the meaning is totally different. NoSQL does use SQL but it uses more than SQL to achieve its goal. As per Wikipedia’s NoSQL Database Definition – “A NoSQL database provides a mechanism for storage and retrieval of data that uses looser consistency models than traditional relational databases.“ Why use NoSQL? A traditional relation database usually deals with predictable structured data. Whereas as the world has moved forward with unstructured data we often see the limitations of the traditional relational database in dealing with them. For example, nowadays we have data in format of SMS, wave files, photos and video format. It is a bit difficult to manage them by using a traditional relational database. I often see people using BLOB filed to store such a data. BLOB can store the data but when we have to retrieve them or even process them the same BLOB is extremely slow in processing the unstructured data. A NoSQL database is the type of database that can handle unstructured, unorganized and unpredictable data that our business needs it. Along with the support to unstructured data, the other advantage of NoSQL Database is high performance and high availability. Eventual Consistency Additionally to note that NoSQL Database may not provided 100% ACID (Atomicity, Consistency, Isolation, Durability) compliance. Though, NoSQL Database does not support ACID they provide eventual consistency. That means over the long period of time all updates can be expected to propagate eventually through the system and data will be consistent. Taxonomy Taxonomy is the practice of classification of things or concepts and the principles. The NoSQL taxonomy supports column store, document store, key-value stores, and graph databases. We will discuss the taxonomy in detail in later blog posts. Here are few of the examples of the each of the No SQL Category. Column: Hbase, Cassandra, Accumulo Document: MongoDB, Couchbase, Raven Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m Graph: Neo4J, Allegro, Virtuoso, Bigdata As of now there are over 150 NoSQL Database and you can read everything about them in this single link. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – Hadoop. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Using the Static Code Analysis feature of Visual Studio (Premium/Ultimate) to find memory leakage problems

- by terje

Memory for managed code is handled by the garbage collector, but if you use any kind of unmanaged code, like native resources of any kind, open files, streams and window handles, your application may leak memory if these are not properly handled. To handle such resources the classes that own these in your application should implement the IDisposable interface, and preferably implement it according to the pattern described for that interface. When you suspect a memory leak, the immediate impulse would be to start up a memory profiler and start digging into that. However, before you follow that impulse, do a Static Code Analysis run with a ruleset tuned to finding possible memory leaks in your code. If you get any warnings from this, fix them before you go on with the profiling. How to use a ruleset In Visual Studio 2010 (Premium and Ultimate editions) you can define your own rulesets containing a list of Static Code Analysis checks. I have defined the memory checks as shown in the lists below as ruleset files, which can be downloaded – see bottom of this post. When you get them, you can easily attach them to every project in your solution using the Solution Properties dialog. Right click the solution, and choose Properties at the bottom, or use the Analyze menu and choose “Configure Code Analysis for Solution”: In this dialog you can now choose the Memorycheck ruleset for every project you want to investigate. Pressing Apply or Ok opens every project file and changes the projects code analysis ruleset to the one we have specified here. How to define your own ruleset (skip this if you just download my predefined rulesets) If you want to define the ruleset yourself, open the properties on any project, choose Code Analysis tab near the bottom, choose any ruleset in the drop box and press Open Clear out all the rules by selecting “Source Rule Sets” in the Group By box, and unselect the box Change the Group By box to ID, and select the checks you want to include from the lists below. Note that you can change the action for each check to either warning, error or none, none being the same as unchecking the check. Now go to the properties window and set a new name and description for your ruleset. Then save (File/Save as) the ruleset using the new name as its name, and use it for your projects as detailed above. It can also be wise to add the ruleset to your solution as a solution item. That way it’s there if you want to enable Code Analysis in some of your TFS builds. Running the code analysis In Visual Studio 2010 you can either do your code analysis project by project using the context menu in the solution explorer and choose “Run Code Analysis”, you can define a new solution configuration, call it for example Debug (Code Analysis), in for each project here enable the Enable Code Analysis on Build In Visual Studio Dev-11 it is all much simpler, just go to the Solution root in the Solution explorer, right click and choose “Run code analysis on solution”. The ruleset checks The following list is the essential and critical memory checks. CheckID Message Can be ignored ? Link to description with fix suggestions CA1001 Types that own disposable fields should be disposable No http://msdn.microsoft.com/en-us/library/ms182172.aspx CA1049 Types that own native resources should be disposable Only if the pointers assumed to point to unmanaged resources point to something else http://msdn.microsoft.com/en-us/library/ms182173.aspx CA1063 Implement IDisposable correctly No http://msdn.microsoft.com/en-us/library/ms244737.aspx CA2000 Dispose objects before losing scope No http://msdn.microsoft.com/en-us/library/ms182289.aspx CA2115 1 Call GC.KeepAlive when using native resources See description http://msdn.microsoft.com/en-us/library/ms182300.aspx CA2213 Disposable fields should be disposed If you are not responsible for release, of if Dispose occurs at deeper level http://msdn.microsoft.com/en-us/library/ms182328.aspx CA2215 Dispose methods should call base class dispose Only if call to base happens at deeper calling level http://msdn.microsoft.com/en-us/library/ms182330.aspx CA2216 Disposable types should declare a finalizer Only if type does not implement IDisposable for the purpose of releasing unmanaged resources http://msdn.microsoft.com/en-us/library/ms182329.aspx CA2220 Finalizers should call base class finalizers No http://msdn.microsoft.com/en-us/library/ms182341.aspx Notes: 1) Does not result in memory leak, but may cause the application to crash The list below is a set of optional checks that may be enabled for your ruleset, because the issues these points too often happen as a result of attempting to fix up the warnings from the first set. ID Message Type of fault Can be ignored ? Link to description with fix suggestions CA1060 Move P/invokes to NativeMethods class Security No http://msdn.microsoft.com/en-us/library/ms182161.aspx CA1816 Call GC.SuppressFinalize correctly Performance Sometimes, see description http://msdn.microsoft.com/en-us/library/ms182269.aspx CA1821 Remove empty finalizers Performance No http://msdn.microsoft.com/en-us/library/bb264476.aspx CA2004 Remove calls to GC.KeepAlive Performance and maintainability Only if not technically correct to convert to SafeHandle http://msdn.microsoft.com/en-us/library/ms182293.aspx CA2006 Use SafeHandle to encapsulate native resources Security No http://msdn.microsoft.com/en-us/library/ms182294.aspx CA2202 Do not dispose of objects multiple times Exception (System.ObjectDisposedException) No http://msdn.microsoft.com/en-us/library/ms182334.aspx CA2205 Use managed equivalents of Win32 API Maintainability and complexity Only if the replace doesn’t provide needed functionality http://msdn.microsoft.com/en-us/library/ms182365.aspx CA2221 Finalizers should be protected Incorrect implementation, only possible in MSIL coding No http://msdn.microsoft.com/en-us/library/ms182340.aspx Downloadable ruleset definitions I have defined three rulesets, one called Inmeta.Memorycheck with the rules in the first list above, and Inmeta.Memorycheck.Optionals containing the rules in the second list, and the last one called Inmeta.Memorycheck.All containing the sum of the two first ones. All three rulesets can be found in the zip archive “Inmeta.Memorycheck” downloadable from here. Links to some other resources relevant to Static Code Analysis MSDN Magazine Article by Mickey Gousset on Static Code Analysis in VS2010 MSDN : Analyzing Managed Code Quality by Using Code Analysis, root of the documentation for this Preventing generated code from being analyzed using attributes Online training course on Using Code Analysis with VS2010 Blogpost by Tatham Oddie on custom code analysis rules How to write custom rules, from Microsoft Code Analysis Team Blog Microsoft Code Analysis Team Blog

Read the article
Big Data – Buzz Words: What is Hadoop – Day 6 of 21

- by Pinal Dave

In yesterday’s blog post we learned what is NoSQL. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – Hadoop. What is Hadoop? Apache Hadoop is an open-source, free and Java based software framework offers a powerful distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs applications on large clusters of commodity hardware and it processes thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and high availability. What are the core components of Hadoop? There are two major components of the Hadoop framework and both fo them does two of the important task for it. Hadoop MapReduce is the method to split a larger data problem into smaller chunk and distribute it to many different commodity servers. Each server have their own set of resources and they have processed them locally. Once the commodity server has processed the data they send it back collectively to main server. This is effectively a process where we process large data effectively and efficiently. (We will understand this in tomorrow’s blog post). Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference between any other file system and Hadoop. When we move a file on HDFS, it is automatically split into many small pieces. These small chunks of the file are replicated and stored on other servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day after tomorrow’s blog post). Besides above two core components Hadoop project also contains following modules as well. Hadoop Common: Common utilities for the other Hadoop modules Hadoop Yarn: A framework for job scheduling and cluster resource management There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will gradually explore in later blog posts. A Multi-node Hadoop Cluster Architecture Now let us quickly see the architecture of the a multi-node Hadoop cluster. A small Hadoop cluster includes a single master node and multiple worker or slave node. As discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and another is of HDFC Layer. Each of these layer have its own relevant component. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or compute node. The matter of the fact that is the key feature of the Hadoop. In this introductory blog post we will stop here while describing the architecture of Hadoop. In a future blog post of this 31 day series we will explore various components of Hadoop Architecture in Detail. Why Use Hadoop? There are many advantages of using Hadoop. Let me quickly list them over here: Robust and Scalable – We can add new nodes as needed as well modify them. Affordable and Cost Effective – We do not need any special hardware for running Hadoop. We can just use commodity server. Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured and unstructured data. Highly Available and Fault Tolerant – When a node fails, the Hadoop framework automatically fails over to another node. Why Hadoop is named as Hadoop? In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo. Doug Cutting named Hadoop after his son’s toy elephant. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – MapReduce. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Books or help on OO Analysis

- by Pat

I have this course where we learn about the domain model, use cases, contracts and eventually leap into class diagrams and sequence diagrams to define good software classes. I just had an exam and I got trashed, but part of the reason is we barely have any practical material, I spent at least two good months without drawing a single class diagram by myself from a case study. I'm not here to blame the system or the class I'm in, I'm just wondering if people have some exercise-style books that either provide domain models with glossaries, system sequence diagrams and ask you to use GRASP to make software classes? I could really use some alone-time practicing going from analysis to conception of software entities. I'm almost done with Larman's book called "Applying UML and Patterns An Introduction to Object-Oriented Analysis and Design and Iterative Development, Third Edition". It's a good book, but I'm not doing anything by myself since it doesn't come with exercises. Thanks.

Read the article
"continue" and "break" for static analysis

- by B. VB.

I know there have been a number of discussions of whether break and continue should be considered harmful generally (with the bottom line being - more or less - that it depends; in some cases they enhance clarity and readability, but in other cases they do not). Suppose a new project is starting development, with plans for nightly builds including a run through a static analyzer. Should it be part of the coding guidelines for the project to avoid (or strongly discourage) the use of continue and break, even if it can sacrifice a little readability and require excessive indentation? I'm most interested in how this applies to C code. Essentially, can the use of these control operators significantly complicate the static analysis of the code possibly resulting in additional false negatives, that would otherwise register a potential fault if break or continue were not used? (Of course a complete static analysis proving the correctness of an aribtrary program is an undecidable proposition, so please keep responses about any hands-on experience with this you have, and not on theoretical impossibilities) Thanks in advance!

Read the article
How to use OO for data analysis? [closed]

- by Konsta

In which ways could object-orientation (OO) make my data analysis more efficient and let me reuse more of my code? The data analysis can be broken up into get data (from db or csv or similar) transform data (filter, group/pivot, ...) display/plot (graph timeseries, create tables, etc.) I mostly use Python and its Pandas and Matplotlib packages for this besides some DB connectivity (SQL). Almost all of my code is a functional/procedural mix. While I have started to create a data object for a certain collection of time series, I wonder if there are OO design patterns/approaches for other parts of the process that might increase efficiency?

Read the article
New Big Data Appliance Security Features

- by mgubar

The Oracle Big Data Appliance (BDA) is an engineered system for big data processing. It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing. The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security. Oracle Database customers have benefited from a rich set of security features: encryption, redaction, data masking, database firewall, label based access control – and much, much more. They want similar capabilities with their Hadoop cluster. Unfortunately, Hadoop wasn’t developed with security in mind. By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database. Some critical security features have been implemented – but even those capabilities are arduous to setup and configure. Oracle believes that a key element of an optimized appliance is that its data should be secure. Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing. Security Starts at Authentication A successful security strategy is predicated on strong authentication – for both users and software services. Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”. The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users. Default Authentication in Hadoop By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system. Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user. In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt. When logged in as the hr user, you can see the following files. Notice, we’re using the Hadoop command line utilities for accessing the data: $ hadoop fs -ls /user/hrdataFound 1 items-rw-r--r-- 1 oracle supergroup 70 2013-10-31 10:38 /user/hrdata/salaries.txt$ hadoop fs -cat /user/hrdata/salaries.txtTom Brady,11000000Tom Hanks,5000000Bob Smith,250000Oprah,300000000 User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”. $ hadoop fs -ls /user Found 1 items drwx------ - hr supergroup 0 2013-10-31 10:38 /user/hrdata However, DrEvil cannot view the contents of the folder due to lack of access privileges: $ hadoop fs -ls /user/hrdata ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------ Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster: $ sudo useradd hr $ sudo passwd $ su hr $ hadoop fs -cat /user/hrdata/salaries.txt Tom Brady,11000000 Tom Hanks,5000000 Bob Smith,250000 Oprah,300000000 Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS. Big Data Appliance Provides Secure Authentication The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration. Figure 1: Kerberos Integration The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc. At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system. It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment. Authorize Access to Sensitive Data Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server. For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration. Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements: Secure Authorization: the ability to control access to data and/or privileges on data for authenticated users. Fine-Grained Authorization: the ability to give users access to a subset of the data (e.g. column) in a database Role-Based Authorization: the ability to create/apply template-based privileges based on functional roles. With Sentry, “all”, “select” or “insert” privileges are granted to an object. The descendants of that object automatically inherit that privilege. A collection of privileges across many objects may be aggregated into a role – and users/groups are then assigned that role. This leads to simplified administration of security across the system. Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views. Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future. Audit Hadoop Cluster Activity Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster: Figure 3: Monitored Hadoop services. At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include: MapReduce: correlate the MapReduce job that accessed the file Oozie: describes who ran what as part of a workflow Hive: captures changes were made to the Hive metadata The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA. Figure 4: Consolidated audit data across the enterprise. Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies. Conclusion Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.

Read the article
Big 0 theta notation

- by niggersak

Can some pls help with the solution Use big-O notation to classify the traditional grade school algorithms for addition and multiplication. That is, if asked to add two numbers each having N digits, how many individual additions must be performed? If asked to multiply two N-digit numbers, how many individual multiplications are required? Suppose f is a function that returns the result of reversing the string of symbols given as its input, and g is a function that returns the concatenation of the two strings given as its input. If x is the string hrwa, what is returned by g(f(x),x)? Explain your answer - don't just provide the result!

Read the article
Microsoft SQL Server 2012 Analysis Services – The BISM Tabular Model #ssas #tabular #bism

- by Marco Russo (SQLBI)

I, Alberto and Chris spent many months (many nights, holidays and also working days of the last months) writing the book we would have liked to read when we started working with Analysis Services Tabular. A book that explains how to use Tabular, how to model data with Tabular, how Tabular internally works and how to optimize a Tabular model. All those things you need to start on a real project in order to make an happy customer. You know, we’re all consultants after all, so customer satisfaction is really important to be paid for our job! Now the book writing is finished, we’re in the final stage of editing and reviews and we look forward to get our print copy. Its title is very long: Microsoft SQL Server 2012 Analysis Services – The BISM Tabular Model. But the important thing is that you can already (pre)order it. This is the list of chapters: 01. BISM Architecture 02. Guided Tour on Tabular 03. Loading Data Inside Tabular 04. DAX Basics 05. Understanding Evaluation Contexts 06. Querying Tabular 07. DAX Advanced 08. Understanding Time Intelligence in DAX 09. Vertipaq Engine 10. Using Tabular Hierarchies 11. Data modeling in Tabular 12. Using Advanced Tabular Relationships 13. Tabular Presentation Layer 14. Tabular and PowerPivot for Excel 15. Tabular Security 16. Interfacing with Tabular 17. Tabular Deployment 18. Optimization and Monitoring And this is the book cover – have a good read!

Read the article
Big Blue Bets Big on Red Hat's Virtualization Technology

Virtually Speaking: In the next battleground for the cloud, IBM's choice of Red Hat's virtualization technology may be changing the landscape, but don't count VMware out just yet.

Read the article
Big Data – Buzz Words: What is NewSQL – Day 10 of 21

- by Pinal Dave

In yesterday’s blog post we learned the importance of the relational database. In this article we will take a quick look at the what is NewSQL. What is NewSQL? NewSQL stands for new scalable and high performance SQL Database vendors. The products sold by NewSQL vendors are horizontally scalable. NewSQL is not kind of databases but it is about vendors who supports emerging data products with relational database properties (like ACID, Transaction etc.) along with high performance. Products from NewSQL vendors usually follow in memory data for speedy access as well are available immediate scalability. NewSQL term was coined by 451 groups analyst Matthew Aslett in this particular blog post. On the definition of NewSQL, Aslett writes: “NewSQL” is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as ‘ScalableSQL‘ to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term ‘NewSQL’ in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL. In other words - NewSQL incorporates the concepts and principles of Structured Query Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and performance of NoSQL. Categories of NewSQL There are three major categories of the NewSQL New Architecture – In this framework each node owns a subset of the data and queries are split into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB MySQL Engines – Highly Optimized storage engine for SQL with the interface of MySQ Lare the example of such category. E.g. InnoDB, Akiban Transparent Sharding – This system automatically split database across multiple nodes. E.g. Scalearc Summary In simple words – NewSQL is kind of database following relational database principals and provides scalability like NoSQL. Tomorrow In tomorrow’s blog post we will discuss about the Role of Cloud Computing in Big Data. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article

Search Results

Search found 11362 results on 455 pages for 'big o analysis'.

Page 2/455 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by dan.mcclary

- by Marco Russo (SQLBI)

- by Marco Russo (SQLBI)

- by james creasy

- by Jean-Pierre Dijcks

- by Pinal Dave

- by Lara Rubbelke

- by Jean-Pierre Dijcks

- by Rahul

- by Marco Russo (SQLBI)

- by Jean-Pierre Dijcks

- by Pinal Dave

- by Pinal Dave

- by Pinal Dave

- by terje

- by Pinal Dave

- by Pat

- by B. VB.

- by Konsta

- by mgubar

- by niggersak

- by Marco Russo (SQLBI)

- by Pinal Dave

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >