Search Results

Search found 9062 results on 363 pages for 'big o'.

Page 6/363 | < Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13  | Next Page >

  • How Big Data and Social Won the Election

    - by Mike Stiles
    The story of big data’s influence on the outcome of the US Presidential election is worth a good look, because a) it’s a harbinger of things to come, and b) it’s an example of similar successes available to any enterprise seriously resourcing integrated big data, modeling, and data-driven execution on all assets, including social. Obama campaign manager Jim Messina fielded a data and analytics brain trust 5 times larger than 2008. At that time, there were numerous databases from various sources, few of them talking to each other. This time, the mission was to be metrics-centered and measure everything measurable, and in context with all the other data. Big data showed them exactly what they needed to know and told them what to do about it. It showed them women 40-49 on the west coast would donate big money if they got to eat with George Clooney. Women on the east coast would pony up to hang out with Sarah Jessica Parker. Extensive daily modeling showed them what kinds of email appeals, from who, and to whom, would prove most successful in raising cash, recruiting volunteers, and getting out the vote. Swing state voters were profiled and approached with more customized targeting that at any time in history. Ads were purchased on specific shows watched by the targets, increasing efficiency 14% over traditional media buys. For all the criticism of the candidate’s focus on appearing on comedy and entertainment shows, and local radio morning shows, that’s where the data sent them to reach the voters most likely to turn out for them. And then there was social. Again, more than in any other election, Facebook was used for virtual, highly efficient door-to-door canvasing. Facebook fans got pictures of friends in swing states and were asked to encourage them to act. Using that approach, 1 in 5 peer-to-peer appeals led to the desired action. Assumptions, gut, intuition, campaign experience, all took a backseat to strategy shifts solidly backed up by data. Zeroing in on demographics likely to back the President and tracking their mood daily literally changed the voter landscape. The Romney team watched Obama voters appear seemingly out of thin air. One Obama campaign aide said, “We ran the election 66,000 times every night.” Which brings us to your organization. If you’re starting to feel like the battle-cry of “but this is the way we’ve always done it” is starting to put you in an extremely vulnerable position, you’re right. Social has become a key communication tool of the 21st century. Failing to use it, or failing to invest in a deep understanding of who your customers and prospects are so the content you post there will achieve desired actions and results, will leave you waking up one morning wondering, “What happened?”@mikestilesPhoto stock.xchng

    Read the article

  • Oracle Social Analytics with the Big Data Appliance

    - by thegreeneman
    Found an awesome demo put together by one of the Oracle NoSQL Database partners, eDBA, on using the Big Data Appliance to do social analytics. In this video, James Anthony is showing off the BDA, Hadoop, the Oracle Big Data Connectors and how they can be used and integrated with the Oracle Database to do an end-to-end sentiment analysis leveraging twitter data.   A really great demo worth the view. 

    Read the article

  • SQL Server 2012 disponible en version finale : AlwaysOn, Big Data, Power View, Microsoft tient ses promesses

    SQL Server 2012 disponible en version finale AlwaysOn, Big Data, Power View, la plateforme de gestion et d'analyse d'information de Microsoft tient ses promesses Mise à jour du 03/04/2012 Comme l'avait promis Microsoft, la version finale de SQL Server 2012 est disponible depuis le 1er avril, mais a été annoncée officiellement hier. La plateforme de gestion et d'analyse d'information de Microsoft a été conçue pour être l'environnement de référence des applications critiques d'entreprise, offrir une solution décisionnelle plus complète intégrant le Big Data et permettre une meilleure connexion avec le Cloud. ...

    Read the article

  • NRF Big Show 2011 -- Part 3

    - by David Dorf
    I'm back from the NRF show having been one of the lucky people who's flight was not canceled. The show was very crowded with a reported 20% increase in attendance and everyone seemed in high spirits. After two years of sluggish retail sales, things are really picking up and it was reflected in everyone's mood. The pop-up Disney Store in the Oracle booth was great and attracted lots of interest in their mobile POS. I know many attendees visited the Disney Store in Times Square to see the entire operation. It's an impressive two-story store that keeps kids engaged. The POS demonstration station, where most of our innovations were demoed, was always crowded. Unfortunately most of the demos used WiFi and the signals from other booths prevented anything from working reliably. Nevertheless, the demo team did an excellent job walking people through the scenarios and explaining how shopping is being impacted by mobile, analytics, and RFID. Big Show Links Disney uncovers its store magic Top 10 Things You Missed at the NRF Big Show 2011 Oracle Retail Stores Innovation Station at NRF Big Show 2011 (video) The buzz of the show was again around mobile solutions. Several companies are creating mobile POS using the iPod Touch, including integrations to Oracle POS for the following retailers: Disney Stores with InfoGain Victoria's Secret with InfoGain Urban Outfitters with Starmount The Gap with Global Bay Keeping with the mobile theme, the NRF release a revised version of their Mobile Blueprint at NRF. It will be posted to the NRF site very soon. The alternate payments section had a major rewrite that provides a great overview and proximity and remote payment technologies. NRF Mobile Blueprint Links New mobile blueprint provides fresh insights NRF Mobile Blueprint 2011 (slides) I hope to do some posts on some of the interesting companies I spoke with in the coming weeks.

    Read the article

  • Big Visible Charts

    - by Robert May
    An important part of Agile is the concept of transparency and visibility. In proper functioning teams, stakeholders can look at any team at any time in the iteration or release and see how that team is doing by simply looking at what we call Big Visible Charts. If you’ve done Scrum, you’ve seen these charts. However, interpreting these charts can often be an art form. There are several different charts that can be useful. In this newsletter, I’ll focus on the Iteration Burndown and Cumulative Flow charts. I’ve included a copy of the spreadsheet that I used to create the charts, and if you don’t have a tool that creates them for you, you can use this spreadsheet to do so. Our preferred tool for managing Scrum projects is Rally. Rally creates all of these charts for you, saving you quite a bit of time. The Iteration Burndown and Cumulative Flow Charts This is the main chart that teams use. Although less useful to stakeholders, this chart is critical to the team and provides quite a bit of information to the team about how their iteration is going. Most charts are a combination of the charts below, so you may need to combine aspects of each section to understand what is happening in your iterations. Ideal Ah, isn’t that a pretty picture? Unfortunately, it’s also very unrealistic. I’ve seen iterations that come close to ideal, but never that match perfectly. If your iteration matches perfectly, chances are, someone is playing with the numbers. Reality is just too difficult to have a burndown chart that matches this exactly. Late Planning Iteration started, but the team didn’t. You can tell this by the fact that the real number of estimated hours didn’t appear until day two. In the cumulative flow, you can also see that nothing was defined in Day one and two. You want to avoid situations like this. You’ll note that the team had to burn faster than is ideal to meet the iteration because of the late planning. This often results in long weeks and days. Testing Starved Determining whether or not testing is starved is difficult without the cumulative flow. The pattern in the burndown could be nothing more that developers not completing stories early enough or could be caused by stories being too big. With the cumulative flow, however, you see that only small bites are in progress and stories were completed early, but testing didn’t start testing until the end of the iteration, and didn’t complete testing all stories in the iteration. When this happens, question whether or not your testing resources are sufficient for your team and whether or not acceptance is adequately defined. No Testing With this one, both graphs show the same thing; the team needs testers and testing! Without testing, what was completed cannot be verified to make sure that it is acceptable to the business. If you find yourself in this situation, review your testing practices and acceptance testing process and make changes today. Late Development With this situation, both graphs tell a story. In the top graph, you can see that the hours failed to burn down as quickly as the team expected. This could be caused by the team not correctly estimating their hours or the team could have had illness or some other issue that affected them. Often, when teams are tackling something that is more unknown, they’ll run into technical barriers that cause the burn down to happen slower than expected. In the cumulative flow graph, you can see that not much was completed in the first few days. This could be because of illness or technical barriers or simply poor estimation. Testing was able to keep up with everything that was completed, however. No Tool Updating When you see graphs that look like this, you can be assured that it’s because the team is not updating the tool that generates the graphs. Review your policy for when they are to update. On the teams that I run, I require that each team member updates the tool at least once daily. You should also check to see how well the team is breaking down stories into tasks. If they’re creating few large tasks, graphs can look similar to this. As a general rule, I never allow tasks, other than Unit Testing and Uncertainty, to be greater than eight hours in duration. Scope Increase I always encourage team members to enter in however much time they think they have left on a task, even if that means increasing the total amount of time left to do. You get a much better and more realistic picture this way. Increasing time remaining could explain the burndown graph, but by looking at the cumulative flow graph, we can see that stories were added to the iteration and scope was increased. Since planning should consume all of the hours in the iteration, this is almost always a bad thing. If the scope change happened late in the iteration and the hours remaining were well below the ideal burn, then increasing scope is probably o.k., but estimation needs to get better. However, with the charts above, that’s clearly not what happened and the team was required to do extra work to make the iteration. If you find this happening, your product owner and ScrumMasters need training. The team also needs to learn to say no. Scope Decrease Scope decreases are just as bad as scope increases. Usually, graphs above show that the team did a poor job of estimating their stories and part way through had to reduce scope to change the iteration. This will happen once in a while, but if you find it’s a pattern on your team, you need to re-evaluate planning. Some teams are hopelessly optimistic. In those cases, I’ll introduce a task I call “Uncertainty.” With Uncertainty, the team estimates how many hours they might need if things don’t go well with the tasks they’ve defined. They try to estimate things that could go poorly and increase the time appropriately. Having an Uncertainty task allows them to have a low and high estimate. Uncertainty should not just be an arbitrary buffer. It must correlate to real uncertainty in the tasks that have been defined. Stories are too Big Often, we see graphs like the ones above. Note that the burndown looks fairly good, other than the chunky acceptance of stories. However, when you look at cumulative flow, you can see that at one point, everything is in progress. This is a bad thing. When you see graphs like this, you’re in one of two states. You may just have a very small team and can only handle one or two stories in your iteration. If you have more than one or two people, then the most likely problem is that your stories are far too big. To combat this, break large high hour stories into smaller pieces that can be completed independently and accepted independently. If you don’t, you’ll likely be requiring your testers to do heroic things to complete testing on the last day of the iteration and you’re much more likely to have the entire iteration fail, because of the limited amount of things that can be completed. Summary There are other charts that can be useful when doing scrum. If you don’t have any big visible charts, you really need to evaluate your process and change. These charts can provide the team a wealth of information and help you write better software. If you have any questions about charts that you’re seeing on your team, contact me with a screen capture of the charts and I’ll tell you what I’m seeing in those charts. I always want this information to be useful, so please let me know if you have other questions. Technorati Tags: Agile

    Read the article

  • Windows Azure Recipe: Big Data

    - by Clint Edmonson
    As the name implies, what we’re talking about here is the explosion of electronic data that comes from huge volumes of transactions, devices, and sensors being captured by businesses today. This data often comes in unstructured formats and/or too fast for us to effectively process in real time. Collectively, we call these the 4 big data V’s: Volume, Velocity, Variety, and Variability. These qualities make this type of data best managed by NoSQL systems like Hadoop, rather than by conventional Relational Database Management System (RDBMS). We know that there are patterns hidden inside this data that might provide competitive insight into market trends.  The key is knowing when and how to leverage these “No SQL” tools combined with traditional business such as SQL-based relational databases and warehouses and other business intelligence tools. Drivers Petabyte scale data collection and storage Business intelligence and insight Solution The sketch below shows one of many big data solutions using Hadoop’s unique highly scalable storage and parallel processing capabilities combined with Microsoft Office’s Business Intelligence Components to access the data in the cluster. Ingredients Hadoop – this big data industry heavyweight provides both large scale data storage infrastructure and a highly parallelized map-reduce processing engine to crunch through the data efficiently. Here are the key pieces of the environment: Pig - a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Mahout - a machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are implemented on top of Apache Hadoop using the map/reduce paradigm. Hive - data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in distributed storage. Directly accessible to Microsoft Office and other consumers via add-ins and the Hive ODBC data driver. Pegasus - a Peta-scale graph mining system that runs in parallel, distributed manner on top of Hadoop and that provides algorithms for important graph mining tasks such as Degree, PageRank, Random Walk with Restart (RWR), Radius, and Connected Components. Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large log data amounts to HDFS. Database – directly accessible to Hadoop via the Sqoop based Microsoft SQL Server Connector for Apache Hadoop, data can be efficiently transferred to traditional relational data stores for replication, reporting, or other needs. Reporting – provides easily consumable reporting when combined with a database being fed from the Hadoop environment. Training These links point to online Windows Azure training labs where you can learn more about the individual ingredients described above. Hadoop Learning Resources (20+ tutorials and labs) Huge collection of resources for learning about all aspects of Apache Hadoop-based development on Windows Azure and the Hadoop and Windows Azure Ecosystems SQL Azure (7 labs) Microsoft SQL Azure delivers on the Microsoft Data Platform vision of extending the SQL Server capabilities to the cloud as web-based services, enabling you to store structured, semi-structured, and unstructured data. See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.

    Read the article

  • Is there something better than a StringBuilder for big blocks of SQL in the code

    - by Eduardo Molteni
    I'm just tired of making a big SQL statement, test it, and then paste the SQL into the code and adding all the sqlstmt.append(" at the beginning and the ") at the end. It's 2011, isn't there a better way the handle a big chunk of strings inside code? Please: don't suggest stored procedures or ORMs. edit Found the answer using XML literals and CData. Thanks to all the people that actually tried to answer the question without questioning me for not using ORM, SPs and using VB edit 2 the question leave me thinking that languages could try to make a better effort for using inline SQL with color syntax, etc. It will be cheaper that developing Linq2SQL. Just something like: dim sql = <sql> SELECT * ... </sql>

    Read the article

  • Skills to Focus on to land Big 5 Software Engineer Position

    - by Megadeth.Metallica
    Guys, I'm in my penultimate quarter of grad school and have a software engineering internship lined up at a big 5 tech company. I have dabbled a lot recently in Python and am average at Java. I want to prepare myself for coding interviews when I apply for new grad positions at the Big 5 tech companies when I graduate at the end of this year. Since I want to have a good shot at all 5 companies (Amazon,Google,Yahoo,Microsoft and Apple) - Should I focus my time and effort on mastering and improving my Java. Or is my time better spent checking out other languages and tools ( Attracted to RoR, Clojure, Git, C# ) I am planning to spend my spring break implementing all the common algorithms and Data structure out of my algorithms textbook in Java.

    Read the article

  • Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 4)

    - by hinkmond
    Here's the first sign of life of a Hexbug Spider Robot converted to become a Skynet Big Data model T-1. Yes, this is T-1 the precursor to the Cyberdyne Systems T-101 (and you know where that will lead to...) It is demonstrating a heartbeat using a simple Java SE Embedded program to drive it. See: Skynet Model T-1 Heartbeat It's alive!!! Well, almost alive. At least there's a pulse. We'll program more to its actions next, and then finally connect it to Skynet Big Data to do more advanced stuff, like hunt for Sara Connor. Java SE Embedded programming makes it simple to create the first model in the long line of T-XXX robots to take on the world. Raspberry Pi makes connecting it all together on one simple device, easy. Next post, I'll show how the wires are connected to drive the T-1 robot. Hinkmond

    Read the article

  • how to get accepted at a big company like google [on hold]

    - by prof
    I'm 18 Years old; I started teaching myself programming when I was twelve. I've developed many projects in PHP, Javascript, Ruby, Ruby on Rails. I know a very little about C, C++, Objective C and extending PHP with extensions created in C Programming Language. Now I'm working as a freelance Web Developer with a very low salary :(, My Dream is to get a good career with very high salary so I thought of Big Companies like Google Or Microsoft. My Question is How to get Accepted on those big Companies ? What Pre-requests they want And do you need to finish collage education ?

    Read the article

  • Globacom and mCentric Deploy BDA and NoSQL Database to analyze network traffic 40x faster

    - by Jean-Pierre Dijcks
    In a fast evolving market, speed is of the essence. mCentric and Globacom leveraged Big Data Appliance, Oracle NoSQL Database to save over 35,000 Call-Processing minutes daily and analyze network traffic 40x faster.  Here are some highlights from the profile: Why Oracle “Oracle Big Data Appliance works well for very large amounts of structured and unstructured data. It is the most agile events-storage system for our collect-it-now and analyze-it-later set of business requirements. Moreover, choosing a prebuilt solution drastically reduced implementation time. We got the big data benefits without needing to assemble and tune a custom-built system, and without the hidden costs required to maintain a large number of servers in our data center. A single support license covers both the hardware and the integrated software, and we have one central point of contact for support,” said Sanjib Roy, CTO, Globacom. Implementation Process It took only five days for Oracle partner mCentric to deploy Oracle Big Data Appliance, perform the software install and configuration, certification, and resiliency testing. The entire process—from site planning to phase-I, go-live—was executed in just over ten weeks, well ahead of the four months allocated to complete the project. Oracle partner mCentric leveraged Oracle Advanced Customer Support Services’ implementation methodology to ensure configurations are tailored for peak performance, all patches are applied, and software and communications are consistently tested using proven methodologies and best practices. Read the entire profile here.

    Read the article

  • SQL – Migrate Database from SQL Server to NuoDB – A Quick Tutorial

    - by Pinal Dave
    Data is growing exponentially and every organization with growing data is thinking of next big innovation in the world of Big Data. Big data is a indeed a future for every organization at one point of the time. Just like every other next big thing, big data has its own challenges and issues. The biggest challenge associated with the big data is to find the ideal platform which supports the scalability and growth of the data. If you are a regular reader of this blog, you must be familiar with NuoDB. I have been working with NuoDB for a while and their recent release is the best thus far. NuoDB is an elastically scalable SQL database that can run on local host, datacenter and cloud-based resources. A key feature of the product is that it does not require sharding (read more here). Last week, I was able to install NuoDB in less than 90 seconds and have explored their Explorer and Admin sections. You can read about my experiences in these posts: SQL – Step by Step Guide to Download and Install NuoDB – Getting Started with NuoDB SQL – Quick Start with Admin Sections of NuoDB – Manage NuoDB Database SQL – Quick Start with Explorer Sections of NuoDB – Query NuoDB Database Many SQL Authority readers have been following me in my journey to evaluate NuoDB. One of the frequently asked questions I’ve received from you is if there is any way to migrate data from SQL Server to NuoDB. The fact is that there is indeed a way to do so and NuoDB provides a fantastic tool which can help users to do it. NuoDB Migrator is a command line utility that supports the migration of Microsoft SQL Server, MySQL, Oracle, and PostgreSQL schemas and data to NuoDB. The migration to NuoDB is a three-step process: NuoDB Migrator generates a schema for a target NuoDB database It loads data into the target NuoDB database It dumps data from the source database Let’s see how we can migrate our data from SQL Server to NuoDB using a simple three-step approach. But before we do that we will create a sample database in MSSQL and later we will migrate the same database to NuoDB: Setup Step 1: Build a sample data CREATE DATABASE [Test]; CREATE TABLE [Department]( [DepartmentID] [smallint] NOT NULL, [Name] VARCHAR(100) NOT NULL, [GroupName] VARCHAR(100) NOT NULL, [ModifiedDate] [datetime] NOT NULL, CONSTRAINT [PK_Department_DepartmentID] PRIMARY KEY CLUSTERED ( [DepartmentID] ASC ) ) ON [PRIMARY]; INSERT INTO Department SELECT * FROM AdventureWorks2012.HumanResources.Department; Note that I am using the SQL Server AdventureWorks database to build this sample table but you can build this sample table any way you prefer. Setup Step 2: Install Java 64 bit Before you can begin the migration process to NuoDB, make sure you have 64-bit Java installed on your computer. This is due to the fact that the NuoDB Migrator tool is built in Java. You can download 64-bit Java for Windows, Mac OSX, or Linux from the following link: http://java.com/en/download/manual.jsp. One more thing to remember is that you make sure that the path in your environment settings is set to your JAVA_HOME directory or else the tool will not work. Here is how you can do it: Go to My Computer >> Right Click >> Select Properties >> Click on Advanced System Settings >> Click on Environment Variables >> Click on New and enter the following values. Variable Name: JAVA_HOME Variable Value: C:\Program Files\Java\jre7 Make sure you enter your Java installation directory in the Variable Value field. Setup Step 3: Install JDBC driver for SQL Server. There are two JDBC drivers available for SQL Server.  Select the one you prefer to use by following one of the two links below: Microsoft JDBC Driver jTDS JDBC Driver In this example we will be using jTDS JDBC driver. Once you download the driver, move the driver to your NuoDB installation folder. In my case, I have moved the JAR file of the driver into the C:\Program Files\NuoDB\tools\migrator\jar folder as this is my NuoDB installation directory. Now we are all set to start the three-step migration process from SQL Server to NuoDB: Migration Step 1: NuoDB Schema Generation Here is the command I use to generate a schema of my SQL Server Database in NuoDB. First I go to the folder C:\Program Files\NuoDB\tools\migrator\bin and execute the nuodb-migrator.bat file. Note that my database name is ‘test’. Additionally my username and password is also ‘test’. You can see that my SQL Server database is running on my localhost on port 1433. Additionally, the schema of the table is ‘dbo’. nuodb-migrator schema –source.driver=net.sourceforge.jtds.jdbc.Driver –source.url=jdbc:jtds:sqlserver://localhost:1433/ –source.username=test –source.password=test –source.catalog=test –source.schema=dbo –output.path=/tmp/schema.sql The above script will generate a schema of all my SQL Server tables and will put it in the folder C:\tmp\schema.sql . You can open the schema.sql file and execute this file directly in your NuoDB instance. You can follow the link here to see how you can execute the SQL script in NuoDB. Please note that if you have not yet created the schema in the NuoDB database, you should create it before executing this step. Step 2: Generate the Dump File of the Data Once you have recreated your schema in NuoDB from SQL Server, the next step is very easy. Here we create a CSV format dump file, which will contain all the data from all the tables from the SQL Server database. The command to do so is very similar to the above command. Be aware that this step may take a bit of time based on your database size. nuodb-migrator dump –source.driver=net.sourceforge.jtds.jdbc.Driver –source.url=jdbc:jtds:sqlserver://localhost:1433/ –source.username=test –source.password=test –source.catalog=test –source.schema=dbo –output.type=csv –output.path=/tmp/dump.cat Once the above command is successfully executed you can find your CSV file in the C:\tmp\ folder. However, you do not have to do anything manually. The third and final step will take care of completing the migration process. Migration Step 3: Load the Data into NuoDB After building schema and taking a dump of the data, the very next step is essential and crucial. It will take the CSV file and load it into the NuoDB database. nuodb-migrator load –target.url=jdbc:com.nuodb://localhost:48004/mytest –target.schema=dbo –target.username=test –target.password=test –input.path=/tmp/dump.cat Please note that in the above script we are now targeting the NuoDB database, which we have already created with the name of “MyTest”. If the database does not exist, create it manually before executing the above script. I have kept the username and password as “test”, but please make sure that you create a more secure password for your database for security reasons. Voila!  You’re Done That’s it. You are done. It took 3 setup and 3 migration steps to migrate your SQL Server database to NuoDB.  You can now start exploring the database and build excellent, scale-out applications. In this blog post, I have done my best to come up with simple and easy process, which you can follow to migrate your app from SQL Server to NuoDB. Download NuoDB I strongly encourage you to download NuoDB and go through my 3-step migration tutorial from SQL Server to NuoDB. Additionally here are two very important blog post from NuoDB CTO Seth Proctor. He has written excellent blog posts on the concept of the Administrative Domains. NuoDB has this concept of an Administrative Domain, which is a collection of hosts that can run one or multiple databases.  Each database has its own TEs and SMs, but all are managed within the Admin Console for that particular domain. http://www.nuodb.com/techblog/2013/03/11/getting-started-provisioning-a-domain/ http://www.nuodb.com/techblog/2013/03/14/getting-started-running-a-database/ Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology Tagged: NuoDB

    Read the article

  • MapRedux - PowerShell and Big Data

    - by Dittenhafer Solutions
    MapRedux – #PowerShell and #Big Data Have you been hearing about “big data”, “map reduce” and other large scale computing terms over the past couple of years and been curious to dig into more detail? Have you read some of the Apache Hadoop online documentation and unfortunately concluded that it wasn't feasible to setup a “test” hadoop environment on your machine? More recently, I have read about some of Microsoft’s work to enable Hadoop on the Azure cloud. Being a "Microsoft"-leaning technologist, I am more inclinded to be successful with experimentation when on the Windows platform. Of course, it is not that I am "religious" about one set of technologies other another, but rather more experienced. Anyway, within the past couple of weeks I have been thinking about PowerShell a bit more as the 2012 PowerShell Scripting Games approach and it occured to me that PowerShell's support for Windows Remote Management (WinRM), and some other inherent features of PowerShell might lend themselves particularly well to a simple implementation of the MapReduce framework. I fired up my PowerShell ISE and started writing just to see where it would take me. Quite simply, the ScriptBlock feature combined with the ability of Invoke-Command to create remote jobs on networked servers provides much of the plumbing of a distributed computing environment. There are some limiting factors of course. Microsoft provided some default settings which prevent PowerShell from taking over a network without administrative approval first. But even with just one adjustment, a given Windows-based machine can become a node in a MapReduce-style distributed computing environment. Ok, so enough introduction. Let's talk about the code. First, any machine that will participate as a remote "node" will need WinRM enabled for remote access, as shown below. This is not exactly practical for hundreds of intended nodes, but for one (or five) machines in a test environment it does just fine. C:> winrm quickconfig WinRM is not set up to receive requests on this machine. The following changes must be made: Set the WinRM service type to auto start. Start the WinRM service. Make these changes [y/n]? y Alternatively, you could take the approach described in the Remotely enable PSRemoting post from the TechNet forum and use PowerShell to create remote scheduled tasks that will call Enable-PSRemoting on each intended node. Invoke-MapRedux Moving on, now that you have one or more remote "nodes" enabled, you can consider the actual Map and Reduce algorithms. Consider the following snippet: $MyMrResults = Invoke-MapRedux -MapReduceItem $Mr -ComputerName $MyNodes -DataSet $dataset -Verbose Invoke-MapRedux takes an instance of a MapReduceItem which references the Map and Reduce scriptblocks, an array of computer names which are the remote nodes, and the initial data set to be processed. As simple as that, you can start working with concepts of big data and the MapReduce paradigm. Now, how did we get there? I have published the initial version of my PsMapRedux PowerShell Module on GitHub. The PsMapRedux module provides the Invoke-MapRedux function described above. Feel free to browse the underlying code and even contribute to the project! In a later post, I plan to show some of the inner workings of the module, but for now let's move on to how the Map and Reduce functions are defined. Map Both the Map and Reduce functions need to follow a prescribed prototype. The prototype for a Map function in the MapRedux module is as follows. A simple scriptblock that takes one PsObject parameter and returns a hashtable. It is important to note that the PsObject $dataset parameter is a MapRedux custom object that has a "Data" property which offers an array of data to be processed by the Map function. $aMap = { Param ( [PsObject] $dataset ) # Indicate the job is running on the remote node. Write-Host ($env:computername + "::Map"); # The hashtable to return $list = @{}; # ... Perform the mapping work and prepare the $list hashtable result with your custom PSObject... # ... The $dataset has a single 'Data' property which contains an array of data rows # which is a subset of the originally submitted data set. # Return the hashtable (Key, PSObject) Write-Output $list; } Reduce Likewise, with the Reduce function a simple prototype must be followed which takes a $key and a result $dataset from the MapRedux's partitioning function (which joins the Map results by key). Again, the $dataset is a MapRedux custom object that has a "Data" property as described in the Map section. $aReduce = { Param ( [object] $key, [PSObject] $dataset ) Write-Host ($env:computername + "::Reduce - Count: " + $dataset.Data.Count) # The hashtable to return $redux = @{}; # Return Write-Output $redux; } All Together Now When everything is put together in a short example script, you implement your Map and Reduce functions, query for some starting data, build the MapReduxItem via New-MapReduxItem and call Invoke-MapRedux to get the process started: # Import the MapRedux and SQL Server providers Import-Module "MapRedux" Import-Module “sqlps” -DisableNameChecking # Query the database for a dataset Set-Location SQLSERVER:\sql\dbserver1\default\databases\myDb $query = "SELECT MyKey, Date, Value1 FROM BigData ORDER BY MyKey"; Write-Host "Query: $query" $dataset = Invoke-SqlCmd -query $query # Build the Map function $MyMap = { Param ( [PsObject] $dataset ) Write-Host ($env:computername + "::Map"); $list = @{}; foreach($row in $dataset.Data) { # Write-Host ("Key: " + $row.MyKey.ToString()); if($list.ContainsKey($row.MyKey) -eq $true) { $s = $list.Item($row.MyKey); $s.Sum += $row.Value1; $s.Count++; } else { $s = New-Object PSObject; $s | Add-Member -Type NoteProperty -Name MyKey -Value $row.MyKey; $s | Add-Member -type NoteProperty -Name Sum -Value $row.Value1; $list.Add($row.MyKey, $s); } } Write-Output $list; } $MyReduce = { Param ( [object] $key, [PSObject] $dataset ) Write-Host ($env:computername + "::Reduce - Count: " + $dataset.Data.Count) $redux = @{}; $count = 0; foreach($s in $dataset.Data) { $sum += $s.Sum; $count += 1; } # Reduce $redux.Add($s.MyKey, $sum / $count); # Return Write-Output $redux; } # Create the item data $Mr = New-MapReduxItem "My Test MapReduce Job" $MyMap $MyReduce # Array of processing nodes... $MyNodes = ("node1", "node2", "node3", "node4", "localhost") # Run the Map Reduce routine... $MyMrResults = Invoke-MapRedux -MapReduceItem $Mr -ComputerName $MyNodes -DataSet $dataset -Verbose # Show the results Set-Location C:\ $MyMrResults | Out-GridView Conclusion I hope you have seen through this article that PowerShell has a significant infrastructure available for distributed computing. While it does take some code to expose a MapReduce-style framework, much of the work is already done and PowerShell could prove to be the the easiest platform to develop and run big data jobs in your corporate data center, potentially in the Azure cloud, or certainly as an academic excerise at home or school. Follow me on Twitter to stay up to date on the continuing progress of my Powershell MapRedux module, and thanks for reading! Daniel

    Read the article

  • Data Integration 12c Raising the Big Data Roof at Oracle OpenWorld

    - by Tanu Sood
    Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-family:"Times New Roman","serif"; mso-fareast-font-family:"MS Mincho";} Author: Dain Hansen, Director, Oracle It was an exciting OpenWorld 2013 for us in the Data Integration track. Our theme this year was all about ‘being future ready’ - previewing one of our biggest releases this year: Oracle Data Integration 12c. Just this week we followed up with this preview by announcing the general availability of 12c release for Oracle’s key data integration products: Oracle Data Integrator 12c and Oracle GoldenGate 12c. The new release delivers extreme performance, increase IT productivity, and simplify deployment, while helping IT organizations to keep pace with new data-oriented technology trends including cloud computing, big data analytics, real-time business intelligence. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-family:"Times New Roman","serif"; mso-fareast-font-family:"MS Mincho";} Mark Hurd's keynote on day one set the tone for the Data Integration sessions. Mark focused on big data analytics and the changing consumer expectations. Especially real-time insight is a key theme for Oracle overall and data integration products. In Mark Hurd's keynote we heard from key customers, such as Airbus and Thomson Reuters, how real-time analysis of operational data including machine data creates value, in some cases even saves lives. Thomas Kurian gave a deeper look into Oracle's big data and fast data solutions. In the initial lead Data Integration track session - Brad Adelberg, VP of Development, presented Oracle’s Data Integration 12c product strategy based on key trends from the initial OpenWorld keynotes. Brad talked about how Oracle's data integration products address the new data integration requirements that evolved with cloud computing, big data, and changing consumer expectations and how they set the key themes in our products’ road map. Brad explained why and how fast-time to value, high-performance and future-ready solutions is the top focus areas for product development. If you were not able to attend OpenWorld or this session I recommend reading the white paper: Five New Data Integration Requirements and How to Meet them with Oracle Data Integration, which provides an in-depth look into how Oracle addresses the new trends in the DI market. Following Brad’s session, Nick Wagner provided in depth review of Oracle GoldenGate’s latest features and roadmap. Nick discussed how Oracle GoldenGate’s tight integration with Oracle Database sets the product apart from the competition. We also heard that heterogeneity of the product is still a major focus for GoldenGate’s development and there will be more news on that front when there is a major release. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-family:"Times New Roman","serif"; mso-fareast-font-family:"MS Mincho";} After GoldenGate’s product strategy session, Denis Gray from the PM team presented Oracle Data Integrator’s product strategy session, talking about the latest and greatest on ODI. Another good session was delivered by long-time GoldenGate users, Comcast.  Jason Hurd and Amit Patel of Comcast talked about the various use cases they deploy Oracle GoldenGate throughout their enterprise, from database upgrades, feeding reporting systems, to active-active database synchronization.  The Comcast team shared many good tips on how to use GoldenGate for both zero downtime upgrades and active-active replication with conflict management requirement. One of our other important goals we had this year for the Data Integration track at OpenWorld was hearing from our customers. We ended day 1 on just that, with a wonderful award ceremony for Oracle Excellence Awards for Oracle Fusion Middleware Innovation. The ceremony was held in the Yerba Buena Center for the Arts. Congratulations to Royal Bank of Scotland and Yalumba Wine Company, the winners in the Data Integration category. You can find more information on the award and the winners in our previous blog post: 2013 Oracle Excellence Awards for Fusion Middleware Innovation… Selected for their innovation use of Oracle’s Data Integration products; the winners for the Data Integration Category are Royal Bank of Scotland and The Yalumba Wine Company. Congratulations!!! Royal Bank of Scotland’s Market and International Banking division provides clients across the globe with seamless trading and competitive pricing, underpinned by a deep knowledge of risk management across the full spectrum of financial products. They handle millions of transactions daily to keep the lifeblood of their clients’ businesses flowing – whether through payment management solutions or through bespoke trade finance solutions. Royal Bank of Scotland is leveraging Oracle GoldenGate and Oracle Data Integrator along with Oracle Business Intelligence Enterprise Edition and the Oracle Database for a variety of solutions. Mainly, Oracle GoldenGate and Oracle Data Integrator are used to feed their data warehouse – providing a real-time data integration solution that feeds transactional data to their analytics system in minutes to enable improved decision making with timely, accurate data for their business users. Oracle Data Integrator’s in-database transformation capabilities and its ability to integrate with Oracle GoldenGate for real-time data capture is the foundation of this implementation. This solution makes it such that changes happening in the analytics systems are available the same day they are deployed on the operational system with 100% data quality guaranteed. Additionally, the solution has helped to reduce their operational database size from 150GB to 10GB. Impressive! Now what if I told you this solution was built in 3 months and had a less than 6 month return on investment? That’s outstanding! The Yalumba Wine Company is situated in the Barossa Valley of Australia. It is the oldest family owned winery in Australia with a unique way of aging their wines in specially crafted 100 liter barrels. Did you know that “Yalumba” is Aboriginal for “all the land around”? The Yalumba Wine Company is growing rapidly, and was in need of introducing a more modern standard to the existing manufacturing processes to meet globalization demands, overall time-to-market, and better operational efficiency objectives of product development. The Yalumba Wine Company worked with a partner, Bristlecone to develop a unique solution whereby Oracle Data Integrator is leveraged to pull data from Salesforce.com and JD Edwards, in addition to their other pre-existing source systems, for consumption into their data warehouse. They have emphasized the overall ease of developing integration workflows with Oracle Data Integrator. The solution has brought better visibility for the business users, shorter data loading and transformation performance to their data warehouse with rapid incorporation of new data sources, and a solid future-proof foundation for their organization. Moving forward, they plan on leveraging more from Oracle’s Data Integration portfolio. Terrific! In addition to these two customers on Tuesday we featured many other important Oracle Data Integrator and Oracle GoldenGate customers. On Tuesday the GoldenGate panel included: Land O’Lakes, Smuckers, and Veolia Water. Besides giving us yummy nutrition and healthy water, these companies have another aspect in common. They all use GoldenGate to boost their ERP application. Please read the recap by Irem Radzik. On Wednesday, the ODI Panel included: Barry Ralston and Ryan Weber of Infinity Insurance, Paul Stracke of Paychex Inc., and Ian Wall of Vertex Pharmaceuticals for a session filled with interesting projects, use cases and approaches to leveraging Oracle Data Integrator. Please read the recap by Sandrine Riley for more. Thanks to everyone who joined with us and we hope to stay connected! To hear more about our Data Integration12c products join us in an upcoming webcast to learn more. Follow us www.twitter.com/ORCLGoldenGate or goto our website at www.oracle.com/goto/dataintegration

    Read the article

  • Solving Big Problems with Oracle R Enterprise, Part II

    - by dbayard
    Part II – Solving Big Problems with Oracle R Enterprise In the first post in this series (see https://blogs.oracle.com/R/entry/solving_big_problems_with_oracle), we showed how you can use R to perform historical rate of return calculations against investment data sourced from a spreadsheet.  We demonstrated the calculations against sample data for a small set of accounts.  While this worked fine, in the real-world the problem is much bigger because the amount of data is much bigger.  So much bigger that our approach in the previous post won’t scale to meet the real-world needs. From our previous post, here are the challenges we need to conquer: The actual data that needs to be used lives in a database, not in a spreadsheet The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network The overall process needs to run fast- much faster than a single processor The actual data needs to be kept secured- another reason to not want to move it from the database and across the network And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes In this post, we will show how we moved from sample data environment to working with full-scale data.  This post is based on actual work we did for a financial services customer during a recent proof-of-concept. Getting started with the Database At this point, we have some sample data and our IRR function.  We were at a similar point in our customer proof-of-concept exercise- we had sample data but we did not have the full customer data yet.  So our database was empty.  But, this was easily rectified by leveraging the transparency features of Oracle R Enterprise (see https://blogs.oracle.com/R/entry/analyzing_big_data_using_the).  The following code shows how we took our sample data SimpleMWRRData and easily turned it into a new Oracle database table called IRR_DATA via ore.create().  The code also shows how we can access the database table IRR_DATA as if it was a normal R data.frame named IRR_DATA. If we go to sql*plus, we can also check out our new IRR_DATA table: At this point, we now have our sample data loaded in the database as a normal Oracle table called IRR_DATA.  So, we now proceeded to test our R function working with database data. As our first test, we retrieved the data from a single account from the IRR_DATA table, pull it into local R memory, then call our IRR function.  This worked.  No SQL coding required! Going from Crawling to Walking Now that we have shown using our R code with database-resident data for a single account, we wanted to experiment with doing this for multiple accounts.  In other words, we wanted to implement the split-apply-combine technique we discussed in our first post in this series.  Fortunately, Oracle R Enterprise provides a very scalable way to do this with a function called ore.groupApply().  You can read more about ore.groupApply() here: https://blogs.oracle.com/R/entry/analyzing_big_data_using_the1 Here is an example of how we ask ORE to take our IRR_DATA table in the database, split it by the ACCOUNT column, apply a function that calls our SimpleMWRR() calculation, and then combine the results. (If you are following along at home, be sure to have installed our myIRR package on your database server via  “R CMD INSTALL myIRR”). The interesting thing about ore.groupApply is that the calculation is not actually performed in my desktop R environment from which I am running.  What actually happens is that ore.groupApply uses the Oracle database to perform the work.  And the Oracle database is what actually splits the IRR_DATA table by ACCOUNT.  Then the Oracle database takes the data for each account and sends it to an embedded R engine running on the database server to apply our R function.  Then the Oracle database combines all the individual results from the calls to the R function. This is significant because now the embedded R engine only needs to deal with the data for a single account at a time.  Regardless of whether we have 20 accounts or 1 million accounts or more, the R engine that performs the calculation does not care.  Given that normal R has a finite amount of memory to hold data, the ore.groupApply approach overcomes the R memory scalability problem since we only need to fit the data from a single account in R memory (not all of the data for all of the accounts). Additionally, the IRR_DATA does not need to be sent from the database to my desktop R program.  Even though I am invoking ore.groupApply from my desktop R program, because the actual SimpleMWRR calculation is run by the embedded R engine on the database server, the IRR_DATA does not need to leave the database server- this is both a performance benefit because network transmission of large amounts of data take time and a security benefit because it is harder to protect private data once you start shipping around your intranet. Another benefit, which we will discuss in a few paragraphs, is the ability to leverage Oracle database parallelism to run these calculations for dozens of accounts at once. From Walking to Running ore.groupApply is rather nice, but it still has the drawback that I run this from a desktop R instance.  This is not ideal for integrating into typical operational processes like nightly data warehouse refreshes or monthly statement generation.  But, this is not an issue for ORE.  Oracle R Enterprise lets us run this from the database using regular SQL, which is easily integrated into standard operations.  That is extremely exciting and the way we actually did these calculations in the customer proof. As part of Oracle R Enterprise, it provides a SQL equivalent to ore.groupApply which it refers to as “rqGroupEval”.  To use rqGroupEval via SQL, there is a bit of simple setup needed.  Basically, the Oracle Database needs to know the structure of the input table and the grouping column, which we are able to define using the database’s pipeline table function mechanisms. Here is the setup script: At this point, our initial setup of rqGroupEval is done for the IRR_DATA table.  The next step is to define our R function to the database.  We do that via a call to ORE’s rqScriptCreate. Now we can test it.  The SQL you use to run rqGroupEval uses the Oracle database pipeline table function syntax.  The first argument to irr_dataGroupEval is a cursor defining our input.  You can add additional where clauses and subqueries to this cursor as appropriate.  The second argument is any additional inputs to the R function.  The third argument is the text of a dummy select statement.  The dummy select statement is used by the database to identify the columns and datatypes to expect the R function to return.  The fourth argument is the column of the input table to split/group by.  The final argument is the name of the R function as you defined it when you called rqScriptCreate(). The Real-World Results In our real customer proof-of-concept, we had more sophisticated calculation requirements than shown in this simplified blog example.  For instance, we had to perform the rate of return calculations for 5 separate time periods, so the R code was enhanced to do so.  In addition, some accounts needed a time-weighted rate of return to be calculated, so we extended our approach and added an R function to do that.  And finally, there were also a few more real-world data irregularities that we needed to account for, so we added logic to our R functions to deal with those exceptions.  For the full-scale customer test, we loaded the customer data onto a Half-Rack Exadata X2-2 Database Machine.  As our half-rack had 48 physical cores (and 96 threads if you consider hyperthreading), we wanted to take advantage of that CPU horsepower to speed up our calculations.  To do so with ORE, it is as simple as leveraging the Oracle Database Parallel Query features.  Let’s look at the SQL used in the customer proof: Notice that we use a parallel hint on the cursor that is the input to our rqGroupEval function.  That is all we need to do to enable Oracle to use parallel R engines. Here are a few screenshots of what this SQL looked like in the Real-Time SQL Monitor when we ran this during the proof of concept (hint: you might need to right-click on these images to be able to view the images full-screen to see the entire image): From the above, you can notice a few things (numbers 1 thru 5 below correspond with highlighted numbers on the images above.  You may need to right click on the above images and view the images full-screen to see the entire image): The SQL completed in 110 seconds (1.8minutes) We calculated rate of returns for 5 time periods for each of 911k accounts (the number of actual rows returned by the IRRSTAGEGROUPEVAL operation) We accessed 103m rows of detailed cash flow/market value data (the number of actual rows returned by the IRR_STAGE2 operation) We ran with 72 degrees of parallelism spread across 4 database servers Most of our 110seconds was spent in the “External Procedure call” event On average, we performed 8,200 executions of our R function per second (110s/911k accounts) On average, each execution was passed 110 rows of data (103m detail rows/911k accounts) On average, we did 41,000 single time period rate of return calculations per second (each of the 8,200 executions of our R function did rate of return calculations for 5 time periods) On average, we processed over 900,000 rows of database data in R per second (103m detail rows/110s) R + Oracle R Enterprise: Best of R + Best of Oracle Database This blog post series started by describing a real customer problem: how to perform a lot of calculations on a lot of data in a short period of time.  While standard R proved to be a very good fit for writing the necessary calculations, the challenge of working with a lot of data in a short period of time remained. This blog post series showed how Oracle R Enterprise enables R to be used in conjunction with the Oracle Database to overcome the data volume and performance issues (as well as simplifying the operations and security issues).  It also showed that we could calculate 5 time periods of rate of returns for almost a million individual accounts in less than 2 minutes. In a future post, we will take the same R function and show how Oracle R Connector for Hadoop can be used in the Hadoop world.  In that next post, instead of having our data in an Oracle database, our data will live in Hadoop and we will how to use the Oracle R Connector for Hadoop and other Oracle Big Data Connectors to move data between Hadoop, R, and the Oracle Database easily.

    Read the article

  • Splitting big request in multiple small ajax requests

    - by Ionut
    I am unsure regarding the scalability of the following model. I have no experience at all with large systems, big number of requests and so on but I'm trying to build some features considering scalability first. In my scenario there is a user page which contains data for: User's details (name, location, workplace ...) User's activity (blog posts, comments...) User statistics (rating, number of friends...) In order to show all this on the same page, for a request there will be at least 3 different database queries on the back-end. In some cases, I imagine that those queries will be running quite a wile, therefore the user experience may suffer while waiting between requests. This is why I decided to run only step 1 (User's details) as a normal request. After the response is received, two ajax requests are sent for steps 2 and 3. When those responses are received, I only place the data in the destined wrappers. For me at least this makes more sense. However there are 3 requests instead of one for every user page view. Will this affect the system on the long term? I'm assuming that this kind of approach requires more resources but is this trade of UX for resources a good dial or should I stick to one plain big request?

    Read the article

  • Solving Big Problems with Oracle R Enterprise, Part I

    - by dbayard
    Abstract: This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set. Overview: Databases are great for managing large amounts of data in a central place with rigorous enterprise-level controls.  R is great for doing advanced computations.  Sometimes you need to do advanced computations on large amounts of data, subject to rigorous enterprise-level concerns.  This blog post shows how Oracle R Enterprise enables R plus the Oracle Database enabled us to do some pretty sophisticated calculations across 1 million accounts (each with many detailed records) in minutes. The problem: A financial services customer of mine has a need to calculate the historical internal rate of return (IRR) for its customers’ portfolios.  This information is needed for customer statements and the online web application.  In the past, they had solved this with a home-grown application that pulled trade and account data out of their data warehouse and ran the calculations.  But this home-grown application was not able to do this fast enough, plus it was a challenge for them to write and maintain the code that did the IRR calculation. IRR – a problem that R is good at solving: Internal Rate of Return is an interesting calculation in that in most real-world scenarios it is impractical to calculate exactly.  Rather, IRR is a calculation where approximation techniques need to be used.  In this blog post, we will discuss calculating the “money weighted rate of return” but in the actual customer proof of concept we used R to calculate both money weighted rate of returns and time weighted rate of returns.  You can learn more about the money weighted rate of returns here: http://www.wikinvest.com/wiki/Money-weighted_return First Steps- Calculating IRR in R We will start with calculating the IRR in standalone/desktop R.  In our second post, we will show how to take this desktop R function, deploy it to an Oracle Database, and make it work at real-world scale.  The first step we did was to get some sample data.  For a historical IRR calculation, you have a balances and cash flows.  In our case, the customer provided us with several accounts worth of sample data in Microsoft Excel.      The above figure shows part of the spreadsheet of sample data.  The data provides balances and cash flows for a sample account (BMV=beginning market value. FLOW=cash flow in/out of account. EMV=ending market value). Once we had the sample spreadsheet, the next step we did was to read the Excel data into R.  This is something that R does well.  R offers multiple ways to work with spreadsheet data.  For instance, one could save the spreadsheet as a .csv file.  In our case, the customer provided a spreadsheet file containing multiple sheets where each sheet provided data for a different sample account.  To handle this easily, we took advantage of the RODBC package which allowed us to read the Excel data sheet-by-sheet without having to create individual .csv files.  We wrote ourselves a little helper function called getsheet() around the RODBC package.  Then we loaded all of the sample accounts into a data.frame called SimpleMWRRData. Writing the IRR function At this point, it was time to write the money weighted rate of return (MWRR) function itself.  The definition of MWRR is easily found on the internet or if you are old school you can look in an investment performance text book.  In the customer proof, we based our calculations off the ones defined in the The Handbook of Investment Performance: A User’s Guide by David Spaulding since this is the reference book used by the customer.  (One of the nice things we found during the course of this proof-of-concept is that by using R to write our IRR functions we could easily incorporate the specific variations and business rules of the customer into the calculation.) The key thing with calculating IRR is the need to solve a complex equation with a numerical approximation technique.  For IRR, you need to find the value of the rate of return (r) that sets the Net Present Value of all the flows in and out of the account to zero.  With R, we solve this by defining our NPV function: where bmv is the beginning market value, cf is a vector of cash flows, t is a vector of time (relative to the beginning), emv is the ending market value, and tend is the ending time. Since solving for r is a one-dimensional optimization problem, we decided to take advantage of R’s optimize method (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/optimize.html). The optimize method can be used to find a minimum or maximum; to find the value of r where our npv function is closest to zero, we wrapped our npv function inside the abs function and asked optimize to find the minimum.  Here is an example of using optimize: where low and high are scalars that indicate the range to search for an answer.   To test this out, we need to set values for bmv, cf, t, emv, tend, low, and high.  We will set low and high to some reasonable defaults. For example, this account had a negative 2.2% money weighted rate of return. Enhancing and Packaging the IRR function With numerical approximation methods like optimize, sometimes you will not be able to find an answer with your initial set of inputs.  To account for this, our approach was to first try to find an answer for r within a narrow range, then if we did not find an answer, try calling optimize() again with a broader range.  See the R help page on optimize()  for more details about the search range and its algorithm. At this point, we can now write a simplified version of our MWRR function.  (Our real-world version is  more sophisticated in that it calculates rate of returns for 5 different time periods [since inception, last quarter, year-to-date, last year, year before last year] in a single invocation.  In our actual customer proof, we also defined time-weighted rate of return calculations.  The beauty of R is that it was very easy to add these enhancements and additional calculations to our IRR package.)To simplify code deployment, we then created a new package of our IRR functions and sample data.  For this blog post, we only need to include our SimpleMWRR function and our SimpleMWRRData sample data.  We created the shell of the package by calling: To turn this package skeleton into something usable, at a minimum you need to edit the SimpleMWRR.Rd and SimpleMWRRData.Rd files in the \man subdirectory.  In those files, you need to at least provide a value for the “title” section. Once that is done, you can change directory to the IRR directory and type at the command-line: The myIRR package for this blog post (which has both SimpleMWRR source and SimpleMWRRData sample data) is downloadable from here: myIRR package Testing the myIRR package Here is an example of testing our IRR function once it was converted to an installable package: Calculating IRR for All the Accounts So far, we have shown how to calculate IRR for a single account.  The real-world issue is how do you calculate IRR for all of the accounts?This is the kind of situation where we can leverage the “Split-Apply-Combine” approach (see http://www.cscs.umich.edu/~crshalizi/weblog/815.html).  Given that our sample data can fit in memory, one easy approach is to use R’s “by” function.  (Other approaches to Split-Apply-Combine such as plyr can also be used.  See http://4dpiecharts.com/2011/12/16/a-quick-primer-on-split-apply-combine-problems/). Here is an example showing the use of “by” to calculate the money weighted rate of return for each account in our sample data set.  Recap and Next Steps At this point, you’ve seen the power of R being used to calculate IRR.  There were several good things: R could easily work with the spreadsheets of sample data we were given R’s optimize() function provided a nice way to solve for IRR- it was both fast and allowed us to avoid having to code our own iterative approximation algorithm R was a convenient language to express the customer-specific variations, business-rules, and exceptions that often occur in real-world calculations- these could be easily added to our IRR functions The Split-Apply-Combine technique can be used to perform calculations of IRR for multiple accounts at once. However, there are several challenges yet to be conquered at this point in our story: The actual data that needs to be used lives in a database, not in a spreadsheet The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network The overall process needs to run fast- much faster than a single processor The actual data needs to be kept secured- another reason to not want to move it from the database and across the network And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes In our next blog post in this series, we will show you how Oracle R Enterprise solved these challenges.

    Read the article

  • When you’re on a high, start something big

    - by BuckWoody
    Most days are pretty average – we have some highs, some lows, and just regular old work to do. But some days the sun is shining, your co-workers are especially nice, and everything just falls into place. You really *enjoy* what you do. Don’t let that moment pass. All of us have “big” projects that we need to tackle. Things that are going to take a long time, and a lot of money. Those kinds of data projects take a LOT of planning, and many times we put that off just to get to the day’s work. I’ve found that the “high” moments are the perfect time to take on these big projects. I’m more focused, and more importantly, more positive. And as the quote goes, “whether you think you can or you think you can’t, you’re probably right.” You’ll find a way to make it happen if you’re in a positive mood. Now – having those “great days” is actually something you can influence, but I’ll save that topic for a future post. I have a project to work on. :) Share this post: email it! | bookmark it! | digg it! | reddit! | kick it! | live it!

    Read the article

  • Go Big or Go Special

    - by Ajarn Mark Caldwell
    Watching Shark Tank tonight and the first presentation was by Mango Mango Preserves and it highlighted an interesting contrast in business trends today and how to capitalize on opportunities.  <Spoiler Alert> Even though every one of the sharks was raving about the product samples they tried, with two of them going for second and third servings, none of them made a deal to invest in the company.</Spoiler>  In fact, one of the sharks, Kevin O’Leary, kept ripping into the owners with statements to the effect that he thinks they are headed over a financial cliff because he felt their costs were way out of line and would be their downfall if they didn’t take action to radically cut costs. He said that he had previously owned a jams and jellies business and knew the cost ratios that you had to have to make it work.  I don’t doubt he knows exactly what he’s talking about and is 100% accurate…for doing business his way, which I’ll call “Go Big”.  But there’s a whole other way to do business today that would be ideal for these ladies to pursue. As I understand it, based on his level of success in various businesses and the fact that he is even in a position to be investing in other companies, Kevin’s approach is to go mass market (Go Big) and make hundreds of millions of dollars in sales (or something along that scale) while squeezing out every ounce of cost that you can to produce an acceptable margin.  But there is a very different way of making a very successful business these days, which is all about building a passionate and loyal community of customers that are rooting for your success and even actively trying to help you succeed by promoting your product or company (Go Special).  This capitalizes on the power of social media, niche marketing, and The Long Tail.  One of the most prolific writers about capitalizing on this trend is Seth Godin, and I hope that the founders of Mango Mango pick up a couple of his books (probably Purple Cow and Tribes would be good starts) or at least read his blog.  I think the adoration expressed by all of the sharks for the product is the biggest hint that they have a remarkable product and that they are perfect for this type of business approach. Both are completely valid business models, and it may certainly be that the scale at which Kevin O’Leary wants to conduct business where he invests his money is well beyond the long tail, but that doesn’t mean that there is not still a lot of money to be made there.  I wish them the best of luck with their endeavors!

    Read the article

  • Is it wise to store a big lump of json on a database row

    - by Ieyasu Sawada
    I have this project which stores product details from amazon into the database. Just to give you an idea on how big it is: [{"title":"Genetic Engineering (Opposing Viewpoints)","short_title":"Genetic Engineering ...","brand":"","condition":"","sales_rank":"7171426","binding":"Book","item_detail_url":"http://localhost/wordpress/product/?asin=0737705124","node_list":"Books > Science & Math > Biological Sciences > Biotechnology","node_category":"Books","subcat":"","model_number":"","item_url":"http://localhost/wordpress/wp-content/ecom-plugin-redirects/ecom_redirector.php?id=128","details_url":"http://localhost/wordpress/product/?asin=0737705124","large_image":"http://localhost/wordpress/wp-content/plugins/ecom/img/large-notfound.png","medium_image":"http://localhost/wordpress/wp-content/plugins/ecom/img/medium-notfound.png","small_image":"http://localhost/wordpress/wp-content/plugins/ecom/img/small-notfound.png","thumbnail_image":"http://localhost/wordpress/wp-content/plugins/ecom/img/thumbnail-notfound.png","tiny_img":"http://localhost/wordpress/wp-content/plugins/ecom/img/tiny-notfound.png","swatch_img":"http://localhost/wordpress/wp-content/plugins/ecom/img/swatch-notfound.png","total_images":"6","amount":"33.70","currency":"$","long_currency":"USD","price":"$33.70","price_type":"List Price","show_price_type":"0","stars_url":"","product_review":"","rating":"","yellow_star_class":"","white_star_class":"","rating_text":" of 5","reviews_url":"","review_label":"","reviews_label":"Read all ","review_count":"","create_review_url":"http://localhost/wordpress/wp-content/ecom-plugin-redirects/ecom_redirector.php?id=132","create_review_label":"Write a review","buy_url":"http://localhost/wordpress/wp-content/ecom-plugin-redirects/ecom_redirector.php?id=19186","add_to_cart_action":"http://localhost/wordpress/wp-content/ecom-plugin-redirects/add_to_cart.php","asin":"0737705124","status":"Only 7 left in stock.","snippet_condition":"in_stock","status_class":"ninstck","customer_images":["http://localhost/wordpress/wp-content/uploads/2013/10/ecom_images/51M2vvFvs2BL.jpg","http://localhost/wordpress/wp-content/uploads/2013/10/ecom_images/31FIM-YIUrL.jpg","http://localhost/wordpress/wp-content/uploads/2013/10/ecom_images/51M2vvFvs2BL.jpg","http://localhost/wordpress/wp-content/uploads/2013/10/ecom_images/51M2vvFvs2BL.jpg"],"disclaimer":"","item_attributes":[{"attr":"Author","value":"Greenhaven Press"},{"attr":"Binding","value":"Hardcover"},{"attr":"EAN","value":"9780737705126"},{"attr":"Edition","value":"1"},{"attr":"ISBN","value":"0737705124"},{"attr":"Label","value":"Greenhaven Press"},{"attr":"Manufacturer","value":"Greenhaven Press"},{"attr":"NumberOfItems","value":"1"},{"attr":"NumberOfPages","value":"224"},{"attr":"ProductGroup","value":"Book"},{"attr":"ProductTypeName","value":"ABIS_BOOK"},{"attr":"PublicationDate","value":"2000-06"},{"attr":"Publisher","value":"Greenhaven Press"},{"attr":"SKU","value":"G0737705124I2N00"},{"attr":"Studio","value":"Greenhaven Press"},{"attr":"Title","value":"Genetic Engineering (Opposing Viewpoints)"}],"customer_review_url":"http://localhost/wordpress/wp-content/ecom-customer-reviews/0737705124.html","flickr_results":["http://localhost/wordpress/wp-content/uploads/2013/10/ecom_images/5105560852_06c7d06f14_m.jpg"],"freebase_text":"No around the web data available yet","freebase_image":"http://localhost/wordpress/wp-content/plugins/ecom/img/freebase-notfound.jpg","ebay_related_items":[{"title":"Genetic Engineering (Introducing Issues With Opposing Viewpoints), , Good Book","image":"http://localhost/wordpress/wp-content/uploads/2013/10/ecom_images/140.jpg","url":"http://localhost/wordpress/wp-content/ecom-plugin-redirects/ecom_redirector.php?id=12165","currency_id":"$","current_price":"26.2"},{"title":"Genetic Engineering Opposing Viewpoints by DAVID BENDER - 1964 Hardcover","image":"http://localhost/wordpress/wp-content/uploads/2013/10/ecom_images/140.jpg","url":"http://localhost/wordpress/wp-content/ecom-plugin-redirects/ecom_redirector.php?id=130","currency_id":"AUD","current_price":"11.99"}],"no_follow":"rel=\"nofollow\"","new_tab":"target=\"_blank\"","related_products":[],"super_saver_shipping":"","shipping_availability":"","total_offers":"7","added_to_cart":""}] So the structure for the table is: asin title details (the product details in json) Will the performance suffer if I have to store like 10,000 products? Is there any other way of doing this? I'm thinking of the following, but the current setup is really the most convenient one since I also have to use the data on the client side: store the product details in a file. So something like ASIN123.json store the product details in one big file. (I'm guessing it will be a drag to extract data from this file) store each of the fields in the details in its own table field Thanks in advance!

    Read the article

< Previous Page | 2 3 4 5 6 7 8 9 10 11 12 13  | Next Page >