Search Results

Search found 66013 results on 2641 pages for 'big data analytics'.

Page 29/2641 | < Previous Page | 25 26 27 28 29 30 31 32 33 34 35 36  | Next Page >

  • make-like build tools for data?

    - by miku
    Make is a standard tools for building software. But make decides whether a target needs to be regenerated by comparing file modification times. Are there any proven, preferably small tools that handle builds not for software but for data? Something that regenerates targets not only on mod times but on certain other properties (e.g. completeness). (Or alternatively some paper that describes such a tool.) As illustration: I'd like to automate the following process: get data (e.g. a tarball) from some regularly updated source copy somewhere if it's not there (based e.g. on some filename-scheme) convert the files to different format (but only if there aren't successfully converted ones there - e.g. from a previous attempt - custom comparison routine) for each file find a certain data element and fetch some additional file from say an URL, but only if that hasn't been downloaded yet (decide on existence of file and file "freshness") finally compute something (e.g. word count for something identifiable and store it in the database, but only if the DB does not have an entry for that exact ID yet) Observations: there are different stages each stage is usually simple to compute or implement in isolation each stage may be simple, but the data volume may be large each stage may produce a few errors each stage may have different signals, on when (re)processing is needed Requirements: builds should be interruptable and idempotent (== robust) when interrupted, already processed objects should be reused to speedup the next run data paths should be easy to adjust (simple syntax, nothing new to learn, internal dsl would be ok) some form of dependency graph, that describes the process would be nice for later visualizations should leverage existing programs, if possible I've done some research on make alternatives like rake and have worked a lot with ant and maven in the past. All these tools naturally focus on code and software build, not on data builds. A system we have in place now for a task similar to the above is pretty much just shell scripts, which are compact (and are a ok glue for a variety of other programs written in other languages), so I wonder if worse is better?

    Read the article

  • Uses of persistent data structures in non-functional languages

    - by Ray Toal
    Languages that are purely functional or near-purely functional benefit from persistent data structures because they are immutable and fit well with the stateless style of functional programming. But from time to time we see libraries of persistent data structures for (state-based, OOP) languages like Java. A claim often heard in favor of persistent data structures is that because they are immutable, they are thread-safe. However, the reason that persistent data structures are thread-safe is that if one thread were to "add" an element to a persistent collection, the operation returns a new collection like the original but with the element added. Other threads therefore see the original collection. The two collections share a lot of internal state, of course -- that's why these persistent structures are efficient. But since different threads see different states of data, it would seem that persistent data structures are not in themselves sufficient to handle scenarios where one thread makes a change that is visible to other threads. For this, it seems we must use devices such as atoms, references, software transactional memory, or even classic locks and synchronization mechanisms. Why then, is the immutability of PDSs touted as something beneficial for "thread safety"? Are there any real examples where PDSs help in synchronization, or solving concurrency problems? Or are PDSs simply a way to provide a stateless interface to an object in support of a functional programming style?

    Read the article

  • Translate report data export from RUEI into HTML for import into OpenOffice Calc Spreadsheets

    - by [email protected]
    A common question of users is, How to import the data from the automated data export of Real User Experience Insight (RUEI) into tools for archiving, dashboarding or combination with other sets of data.XML is well-suited for such a translation via the companion Extensible Stylesheet Language Transformations (XSLT). Basically XSLT utilizes XSL, a template on what to read from your input XML data file and where to place it into the target document. The target document can be anything you like, i.e. XHTML, CSV, or even a OpenOffice Spreadsheet, etc. as long as it is a plain text format.XML 2 OpenOffice.org SpreadsheetFor the XSLT to work as an OpenOffice.org Calc Import Filter:How to add an XML Import Filter to OpenOffice CalcStart OpenOffice.org Calc andselect Tools > XML Filter SettingsNew...Fill in the details as follows:Filter name: RUEI Import filterApplication: OpenOffice.org Calc (.ods)Name of file type: Oracle Real User Experience InsightFile extension: xmlSwitch to the transformation tab and enter/select the following leaving the rest untouchedXSLT for import: ruei_report_data_import_filter.xslPlease see at the end of this blog post for a download of the referenced file.Select RUEI Import filter from list and Test XSLTClick on Browse to selectTransform file: export.php.xmlOpenOffice.org Calc will transform and load the XML file you retrieved from RUEI in a human-readable format.You can now select File > Open... and change the filetype to open your RUEI exports directly in OpenOffice.org Calc, just like any other a native Spreadsheet format.Files of type: Oracle Real User Experience Insight (*.xml)File name: export.php.xml XML 2 XHTMLMost XML-powered browsers provides for inherent XSL Transformation capabilities, you only have to reference the XSLT Stylesheet in the head of your XML file. Then open the file in your favourite Web Browser, Firefox, Opera, Safari or Internet Explorer alike.<?xml version="1.0" encoding="ISO-8859-1"?><!-- inserted line below --> <?xml-stylesheet type="text/xsl" href="ruei_report_data_export_2_xhtml.xsl"?><!-- inserted line above --><report>You can find a patched example export from RUEI plus the above referenced XSL-Stylesheets here: export.php.xml - Example report data export from RUEI ruei_report_data_export_2_xhtml.xsl - RUEI to XHTML XSL Transformation Stylesheetruei_report_data_import_filter.xsl - OpenOffice.org XML import filter for RUEI report export data If you would like to do things like this on the command line you can use either Xalan or xsltproc.The basic command syntax for xsltproc is very simple:xsltproc -o output.file stylesheet.xslt inputfile.xmlYou can use this with the above two stylesheets to translate RUEI Data Exports into XHTML and/or OpenOffice.org Calc ODS-Format. Or you could write your own XSLT to transform into Comma separated Value lists.Please let me know what you think or do with this information in the comments below.Kind regards,Stefan ThiemeReferences used:OpenOffice XML Filter - Create XSLT filters for import and export - http://user.services.openoffice.org/en/forum/viewtopic.php?f=45&t=3490SUN OpenOffice.org XML File Format 1.0 - http://xml.openoffice.org/xml_specification.pdf

    Read the article

  • Security Controls on data for P6 Analytics

    - by Jeffrey McDaniel
    The Star database and P6 Analytics calculates security based on P6 security using OBS, global, project, cost, and resource security considerations. If there is some concern that users are not seeing expected data in P6 Analytics here are some areas to review: 1. Determining if a user has cost security is based on the Project level security privileges - either View Project Costs/Financials or Edit EPS Financials. If expecting to see costs make sure one of these permissions are allocated.  2. User must have OBS access on a Project. Not WBS level. WBS level security is not supported. Make sure user has OBS on project level.  3. Resource Access is determined by what is granted in P6. Verify the resource access granted to this user in P6. Resource security is hierarchical. Project access will override Resource access based on the way security policies are applied. 4. Module access must be given to a P6 user for that user to come over into Star/P6 Analytics. For earlier version of RDB there was a report_user_flag on the Users table. This flag field is no longer used after P6 Reporting Database 2.1. 5. For P6 Reporting Database versions 2.2 and higher, the Extended Schema Security service must be run to calculate all security. Any changes to privileges or security this service must be rerun before any ETL. 6. In P6 Analytics 2.0 or higher, a Weblogic user must exist that matches the P6 username. For example user Tim must exist in P6 and Weblogic users for Tim to be able to log into P6 Analytics and access data based on  P6 security.  In earlier versions the username needed to exist in RPD. 7. Cache in OBI is another area that can sometimes make it seem a user isn't seeing the data they expect. While cache can be beneficial for performance in OBI. If the data is outdated it can retrieve older, stale data. Clearing or turning off cache when rerunning a query can determine if the returned result set was from cache or from the database.

    Read the article

  • Social Analytics and the Customer

    - by David Dorf
    Many successful retailers put the customer at the center of everything they do, so its important that the customer is modeled correctly across all their systems.  The path to omni-channel starts and ends with the customer so at ARTS, our next big project is focused on ensuring a consistent representation of customers across our transactional data model, datawarehouse model, and XML schemas.  Further, we've started a new whitepaper that describes how Big Data and Social Media Analytics should be leveraged by retailers to add and additional level of customer insight. Let's start by taking a closer look at the meaning of social analytics.  Here's my definition: Social Analytics, in the retail context, describes the analysis of data obtained from social media sources in an effort to better comprehend and interact with the community of consumers.  This discipline seeks to understand what’s being said by the community about brands and products (“monitoring”), as well as understand the behaviors of those in the community (“profiling”).  The results are used to enforce the brand image, improve product decisions, and better focus marketing, all of which lead to increased sales. To help illustrate the facets of social analytics, I drew the diagram below which was originally published by Retail Touchpoints. There are lots of tools on the market that allow retailers to monitor social media for brand and product mentions.  These include analysis of sentiment, reach, share of voice, engagement, etc.  When your brand is mentioned, good or bad, its an opportunity to engage with the customer and possibly lead to a sale.  Because products are not always unique, its much more difficult to monitor product mentions, but detecting product trends early can help a retailer make better merchandising decisions, especially in fashion. Once a retailer understands what's being said, the next step is learn more about who's saying it.  That involves profiling customers beyond simple demographics to understand their motivations.  Much can be learned from patterns, and even more when customers voluntarily share their data.  Knowing that a customer is passionate about, for example, mountain biking allows the retailer to make relevant offers on helmets, ask for opinions on hydration, and help spread marketing messages. Social analytics has many facets that benefit retailers, some of which are easy but many of which are hard.  Its important for the CMO and CIO to work closely together to plan for these capabilities and monitor the maturity of tools on the market.  This is an area that will separate winners from losers.

    Read the article

  • Data Auditor by Example

    - by Jinjin.Wang
    OWB has a node Data Auditors under Oracle Module in Projects Navigator. What is data auditor and how to use it? I will give an introduction to data auditor and show its usage by examples. Data auditor is an important tool in ensuring that data quality levels meet business requirements. Data auditor validates data against a set of data rules to determine which records comply and which do not. It gathers statistical metrics on how well the data in a system complies with a rule by auditing and marking how many errors are occurring against the audited table. Data auditors are typically scheduled for regular execution as part of a process flow, to monitor the quality of the data in an operational environment such as a data warehouse or ERP system, either immediately after updates like data loads, or at regular intervals. How to use data auditor to monitor data quality? Only objects with data rules can be monitored, so the first step is to define data rules according to business requirements and apply them to the objects you want to monitor. The objects can be tables, views, materialized views, and external tables. Secondly create a data auditor containing the objects. You can configure the data auditor and set physical deployment parameters for it as optional, which will be used while running the data auditor. Then deploy and run the data auditor either manually or as part of the process flow. After execution, the data auditor sets several output values, and records that are identified as not complying with the defined data rules contained in the data auditor are written to error tables. Here is an example. We have two tables DEPARTMENTS and EMPLOYEES (see pic-1 and pic-2. Click here for DDL and data) imported into OWB. We want to gather statistical metrics on how well data in these two tables satisfies the following requirements: a. Values of the EMPLOYEES.EMPLOYEE_ID attribute are three-digit numbers. b. Valid values for EMPLOYEES.JOB_ID are IT_PROG, SA_REP, SH_CLERK, PU_CLERK, and ST_CLERK. c. EMPLOYEES.EMPLOYEE_ID is related to DEPARTMENTS.MANAGER_ID. Pic-1 EMPLOYEES Pic-2 DEPARTMENTS 1. To determine legal data within EMPLOYEES or legal relationships between data in different columns of the two tables, firstly we define data rules based on the three requirements and apply them to tables. a. The first requirement is about patterns that an attribute is allowed to conform to. We create a Domain Pattern List data rule EMPLOYEE_PATTERN_RULE here. The pattern is defined in the Oracle Database regular expression syntax as ^([0-9]{3})$ Apply data rule EMPLOYEE_PATTERN_RULE to table EMPLOYEES.

    Read the article

  • Master Data Management for Location Data - Oracle Site Hub

    - by david.butler(at)oracle.com
    Most MDM discussions cover key domains such as customer, supplier, product, service, and reference data. It is usually understood that these domains have complex structures and hundreds if not thousands of attributes that need governing. Location, on the other hand, strikes most people as address data. How hard can that be? But for many industries, locations are complex, and site information is critical to efficient operations and relevant analytics. Retail stores and malls, bank branches, construction sites come to mind. But one of the best industries for illustrating the power of a site mastering application is Oil & Gas.   Oracle's Master Data Management solution for location data is the Oracle Site Hub. It is a location mastering solution that enables organizations to centralize site and location specific information from heterogeneous systems, creating a single view of site information that can be leveraged across all functional departments and analytical systems.   Let's take a look at the location entities the Oracle Site Hub can manage for the Oil & Gas industry: organizations, property, land, buildings, roads, oilfield, service center, inventory site, real estate, facilities, refineries, storage tanks, vendor locations, businesses, assets; project site, area, well, basin, pipelines, critical infrastructure, offshore platform, compressor station, gas station, etc. Any site can be classified into multiple hierarchies, like organizational hierarchy, operational hierarchy, geographic hierarchy, divisional hierarchies and so on. Any site can also be associated to multiple clusters, i.e. collections of sites, and these can be used as a foundation for driving reporting, analysis, organize daily work, etc. Hierarchies can also be used to model entities which are structured or non-structured collections of nodes, like for example routes, pipelines and more. The User Defined Attribute Framework provides the needed infrastructure to add single row attributes groups like well base attributes (well IDs, well type, well structure and key characterizing measures, and more) and well geometry, and multi row attribute groups like well applications, permits, production data, activities, operations, logs, treatments, tests, drills, treatments, and KPIs. Site Hub can also model areas, lands, fields, basins, pools, platforms, eco-zones, and stratigraphic layers as specific sites, tracking their base attributes, aliases, descriptions, subcomponents and more. Midstream entities (pipelines, logistic sites, pump stations) and downstream entities (cylinders, tanks, inventories, meters, partner's sites, routes, facilities, gas stations, and competitor sites) can also be easily modeled, together with their specific attributes and relationships. Site Hub can store any type of unstructured data associated to a site. This could be stored directly or on an external content management solution, like Oracle Universal Content Management. Considering a well, for example, Site Hub can store any relevant associated multimedia file such as: CAD drawings of the well profile, structure and/or parts, engineering documents, contracts, applications, permits, logs, pictures, photos, videos and more. For any site entity, Site Hub can associate all the related assets and equipments at the site, as well as all relationships between sites, between a site and multiple parties, and between a site and any purchasable or sellable item, over time. Items can be equipment, instruments, facilities, services, products, production entities, production facilities (pipelines, batteries, compressor stations, gas plants, meters, separators, etc.), support facilities (rigs, roads, transmission or radio towers, airstrips, etc.), supplier products and services, catalogs, and more. Items can just be associated to sites using standard Site Hub features, or they can be fully mastered by implementing Oracle Product Hub. Site locations (addresses or geographical coordinates) are also managed with out-of-the-box address geo-coding capabilities coupled with Google Maps integration to deliver powerful mapping capabilities and spatial data analysis. Locations can be shared between different sites. Centered on the site location, any site can also have associated areas. Site Hub can master any site location specific information, like for example cadastral, ownership, jurisdictional, geological, seismic and more, and any site-centric area specific information, like for example economical, political, risk, weather, logistic, traffic information and more. Now if anyone ever asks you why locations need MDM, think about how all these Oil & Gas entities and attributes would translate into your business locations. To learn more about Oracle's full MDM solution for the digital oil field, here is a link to Roberto Negro's outstanding whitepaper: Oracle Site Master Data Management for mastering wells and other PPDM entities in a digital oilfield context  

    Read the article

  • Optimal Data Structure for our own API

    - by vermiculus
    I'm in the early stages of writing an Emacs major mode for the Stack Exchange network; if you use Emacs regularly, this will benefit you in the end. In order to minimize the number of calls made to Stack Exchange's API (capped at 10000 per IP per day) and to just be a generally responsible citizen, I want to cache the information I receive from the network and store it in memory, waiting to be accessed again. I'm really stuck as to what data structure to store this information in. Obviously, it is going to be a list. However, as with any data structure, the choice must be determined by what data is being stored and what how it will be accessed. What, I would like to be able to store all of this information in a single symbol such as stack-api/cache. So, without further ado, stack-api/cache is a list of conses keyed by last update: `(<csite> <csite> <csite>) where <csite> would be (1362501715 . <site>) At this point, all we've done is define a simple association list. Of course, we must go deeper. Each <site> is a list of the API parameter (unique) followed by a list questions: `("codereview" <cquestion> <cquestion> <cquestion>) Each <cquestion> is, you guessed it, a cons of questions with their last update time: `(1362501715 <question>) (1362501720 . <question>) <question> is a cons of a question structure and a list of answers (again, consed with their last update time): `(<question-structure> <canswer> <canswer> <canswer> and ` `(1362501715 . <answer-structure>) This data structure is likely most accurately described as a tree, but I don't know if there's a better way to do this considering the language, Emacs Lisp (which isn't all that different from the Lisp you know and love at all). The explicit conses are likely unnecessary, but it helps my brain wrap around it better. I'm pretty sure a <csite>, for example, would just turn into (<epoch-time> <api-param> <cquestion> <cquestion> ...) Concerns: Does storing data in a potentially huge structure like this have any performance trade-offs for the system? I would like to avoid storing extraneous data, but I've done what I could and I don't think the dataset is that large in the first place (for normal use) since it's all just human-readable text in reasonable proportion. (I'm planning on culling old data using the times at the head of the list; each inherits its last-update time from its children and so-on down the tree. To what extent this cull should take place: I'm not sure.) Does storing data like this have any performance trade-offs for that which must use it? That is, will set and retrieve operations suffer from the size of the list? Do you have any other suggestions as to what a better structure might look like?

    Read the article

  • How do you handle the fetchxml result data?

    - by Luke Baulch
    I have avoided working with fetchxml as I have been unsure the best way to handle the result data after calling crmService.Fetch(fetchXml). In a couple of situations, I have used an XDocument with LINQ to retrieve the data from this data structure, such as: XDocument resultset = XDocument.Parse(_service.Fetch(fetchXml)); if (resultset.Root == null || !resultset.Root.Elements("result").Any()) { return; } foreach (var displayItem in resultset.Root.Elements("result").Select(item => item.Element(displayAttributeName)).Distinct()) { if (displayItem!= null && displayItem.Value != null) { dropDownList.Items.Add(displayItem.Value); } } What is the best way to handle fetchxml result data, so that it can be easily used. Applications such as passing these records into an ASP.NET datagrid would be quite useful.

    Read the article

  • Generate Entity Data Model from Data Contract

    - by CSmooth.net
    I would like to find a fast way to convert a Data Contract to a Entity Data Model. Consider the following Data Contract: [DataContract] class PigeonHouse { [DataMember] public string housename; [DataMember] public List<Pigeon> pigeons; } [DataContract] class Pigeon { [DataMember] public string name; [DataMember] public int numberOfWings; [DataMember] public int age; } Is there an easy way to create an ADO.NET Entity Data Model from this code?

    Read the article

  • Detecting a Lightweight Core Data Migration

    - by hadronzoo
    I'm using Core Data's automatic lightweight migration successfully. However, when a particular entity gets created during a migration, I'd like to populate it with some data. Of course I could check if the entity is empty every time the application starts, but this seems inefficient when Core Data has a migration framework. Is it possible to detect when a lightweight migration occurs (possibly using KVO or notifications), or does this require implementing standard migrations? I've tried using the NSPersistentStoreCoordinatorStoresDidChangeNotification, but it doesn't fire when migrations occur.

    Read the article

  • Getting started with massive data

    - by Max
    I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from? Hadoop/MapReduce is one obvious start. Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?) I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with? Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale. Any other suggestions on things to learn would be great!

    Read the article

  • what is best way to store long term data in iphone Core Data or SQLLite?

    - by AmitSri
    Hi all, I am working on i-Phone app targeting 3.1.3 and later SDK. I want to know the best way to store user's long term data on i-phone without losing performance, consistency and security. I know, that i can use Core Data, PList and SQL-Lite for storing user specific data in custom formats.But, want to know which one is good to use without compromising app performance and scalability in near future. Thanks

    Read the article

  • How to view existing data in Core Data?

    - by mshsayem
    Well, may be this question is silly, but I couldn't find a way (except programmatically). I built a project (for iPhone OS 3.0) which uses Core Data. The xcdatamodel file shows the schema description, but I want to see the data in tabular form (like the management studio for mssql server or phpmyadmin for mysql). Is there any way (except coding)? What is that? Also, which file (in disk/device) those data are stored into? [ I built the tutorial (from apple) on Core Data, named Locations. They used this line somewhere in the code: NSURL *storeUrl = [NSURL fileURLWithPath: [[self applicationDocumentsDirectory] stringByAppendingPathComponent: @"Locations.sqlite"]]; But, I did not see any "xxxxx.sqlite" file in project location (nor in the disk).]

    Read the article

  • Efficient alternatives to merge for larger data.frames R

    - by Etienne Low-Décarie
    I am looking for an efficient (both computer resource wise and learning/implementation wise) method to merge two larger (size1 million / 300 KB RData file) data frames. "merge" in base R and "join" in plyr appear to use up all my memory effectively crashing my system. Example load test data frame and try test.merged<-merge(test, test) or test.merged<-join(test, test, type="all") - The following post provides a list of merge and alternatives: How to join data frames in R (inner, outer, left, right)? The following allows object size inspection: https://heuristically.wordpress.com/2010/01/04/r-memory-usage-statistics-variable/ Data produced by anonym

    Read the article

  • find the top K most frequent numbers in a data stream

    - by Jin
    This is more of a data structure question rather than a coding question. If I am fetching a data stream, i.e, I keep receiving float numbers once at a time, how should I keep track of the top K frequent numbers? Here my memory is 4G and I prefer to have less communication with hard drive unless necessary. I think heap is good for updating the max and min. How should I design the data structure? Thanks

    Read the article

  • Introducing the Industry's First Analytics Machine, Oracle Exalytics

    - by Manan Goel
    Analytics is all about gaining insights from the data for better decision making. The business press is abuzz with examples of leading organizations across the world using data-driven insights for strategic, financial and operational excellence. A recent study on “data-driven decision making” conducted by researchers at MIT and Wharton provides empirical evidence that “firms that adopt data-driven decision making have output and productivity that is 5-6% higher than the competition”. The potential payoff for firms can range from higher shareholder value to a market leadership position. However, the vision of delivering fast, interactive, insightful analytics has remained elusive for most organizations. Most enterprise IT organizations continue to struggle to deliver actionable analytics due to time-sensitive, sprawling requirements and ever tightening budgets. The issue is further exasperated by the fact that most enterprise analytics solutions require dealing with a number of hardware, software, storage and networking vendors and precious resources are wasted integrating the hardware and software components to deliver a complete analytical solution. Oracle Exalytics In-Memory Machine is the world’s first engineered system specifically designed to deliver high performance analysis, modeling and planning. Built using industry-standard hardware, market-leading business intelligence software and in-memory database technology, Oracle Exalytics is an optimized system that delivers answers to all your business questions with unmatched speed, intelligence, simplicity and manageability. Oracle Exalytics’s unmatched speed, visualizations and scalability delivers extreme performance for existing analytical and enterprise performance management applications and enables a new class of intelligent applications like Yield Management, Revenue Management, Demand Forecasting, Inventory Management, Pricing Optimization, Profitability Management, Rolling Forecast and Virtual Close etc. Requiring no application redesign, Oracle Exalytics can be deployed in existing IT environments by itself or in conjunction with Oracle Exadata and/or Oracle Exalogic to enable extreme performance and best in class user experience. Based on proven hardware, software and in-memory technology, Oracle Exalytics lowers the total cost of ownership, reduces operational risk and provides unprecedented analytical capability for workgroup, departmental and enterprise wide deployments. Click here to learn more about Oracle Exalytics.  

    Read the article

  • Solving Big Problems with Oracle R Enterprise, Part I

    - by dbayard
    Abstract: This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set. Overview: Databases are great for managing large amounts of data in a central place with rigorous enterprise-level controls.  R is great for doing advanced computations.  Sometimes you need to do advanced computations on large amounts of data, subject to rigorous enterprise-level concerns.  This blog post shows how Oracle R Enterprise enables R plus the Oracle Database enabled us to do some pretty sophisticated calculations across 1 million accounts (each with many detailed records) in minutes. The problem: A financial services customer of mine has a need to calculate the historical internal rate of return (IRR) for its customers’ portfolios.  This information is needed for customer statements and the online web application.  In the past, they had solved this with a home-grown application that pulled trade and account data out of their data warehouse and ran the calculations.  But this home-grown application was not able to do this fast enough, plus it was a challenge for them to write and maintain the code that did the IRR calculation. IRR – a problem that R is good at solving: Internal Rate of Return is an interesting calculation in that in most real-world scenarios it is impractical to calculate exactly.  Rather, IRR is a calculation where approximation techniques need to be used.  In this blog post, we will discuss calculating the “money weighted rate of return” but in the actual customer proof of concept we used R to calculate both money weighted rate of returns and time weighted rate of returns.  You can learn more about the money weighted rate of returns here: http://www.wikinvest.com/wiki/Money-weighted_return First Steps- Calculating IRR in R We will start with calculating the IRR in standalone/desktop R.  In our second post, we will show how to take this desktop R function, deploy it to an Oracle Database, and make it work at real-world scale.  The first step we did was to get some sample data.  For a historical IRR calculation, you have a balances and cash flows.  In our case, the customer provided us with several accounts worth of sample data in Microsoft Excel.      The above figure shows part of the spreadsheet of sample data.  The data provides balances and cash flows for a sample account (BMV=beginning market value. FLOW=cash flow in/out of account. EMV=ending market value). Once we had the sample spreadsheet, the next step we did was to read the Excel data into R.  This is something that R does well.  R offers multiple ways to work with spreadsheet data.  For instance, one could save the spreadsheet as a .csv file.  In our case, the customer provided a spreadsheet file containing multiple sheets where each sheet provided data for a different sample account.  To handle this easily, we took advantage of the RODBC package which allowed us to read the Excel data sheet-by-sheet without having to create individual .csv files.  We wrote ourselves a little helper function called getsheet() around the RODBC package.  Then we loaded all of the sample accounts into a data.frame called SimpleMWRRData. Writing the IRR function At this point, it was time to write the money weighted rate of return (MWRR) function itself.  The definition of MWRR is easily found on the internet or if you are old school you can look in an investment performance text book.  In the customer proof, we based our calculations off the ones defined in the The Handbook of Investment Performance: A User’s Guide by David Spaulding since this is the reference book used by the customer.  (One of the nice things we found during the course of this proof-of-concept is that by using R to write our IRR functions we could easily incorporate the specific variations and business rules of the customer into the calculation.) The key thing with calculating IRR is the need to solve a complex equation with a numerical approximation technique.  For IRR, you need to find the value of the rate of return (r) that sets the Net Present Value of all the flows in and out of the account to zero.  With R, we solve this by defining our NPV function: where bmv is the beginning market value, cf is a vector of cash flows, t is a vector of time (relative to the beginning), emv is the ending market value, and tend is the ending time. Since solving for r is a one-dimensional optimization problem, we decided to take advantage of R’s optimize method (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/optimize.html). The optimize method can be used to find a minimum or maximum; to find the value of r where our npv function is closest to zero, we wrapped our npv function inside the abs function and asked optimize to find the minimum.  Here is an example of using optimize: where low and high are scalars that indicate the range to search for an answer.   To test this out, we need to set values for bmv, cf, t, emv, tend, low, and high.  We will set low and high to some reasonable defaults. For example, this account had a negative 2.2% money weighted rate of return. Enhancing and Packaging the IRR function With numerical approximation methods like optimize, sometimes you will not be able to find an answer with your initial set of inputs.  To account for this, our approach was to first try to find an answer for r within a narrow range, then if we did not find an answer, try calling optimize() again with a broader range.  See the R help page on optimize()  for more details about the search range and its algorithm. At this point, we can now write a simplified version of our MWRR function.  (Our real-world version is  more sophisticated in that it calculates rate of returns for 5 different time periods [since inception, last quarter, year-to-date, last year, year before last year] in a single invocation.  In our actual customer proof, we also defined time-weighted rate of return calculations.  The beauty of R is that it was very easy to add these enhancements and additional calculations to our IRR package.)To simplify code deployment, we then created a new package of our IRR functions and sample data.  For this blog post, we only need to include our SimpleMWRR function and our SimpleMWRRData sample data.  We created the shell of the package by calling: To turn this package skeleton into something usable, at a minimum you need to edit the SimpleMWRR.Rd and SimpleMWRRData.Rd files in the \man subdirectory.  In those files, you need to at least provide a value for the “title” section. Once that is done, you can change directory to the IRR directory and type at the command-line: The myIRR package for this blog post (which has both SimpleMWRR source and SimpleMWRRData sample data) is downloadable from here: myIRR package Testing the myIRR package Here is an example of testing our IRR function once it was converted to an installable package: Calculating IRR for All the Accounts So far, we have shown how to calculate IRR for a single account.  The real-world issue is how do you calculate IRR for all of the accounts?This is the kind of situation where we can leverage the “Split-Apply-Combine” approach (see http://www.cscs.umich.edu/~crshalizi/weblog/815.html).  Given that our sample data can fit in memory, one easy approach is to use R’s “by” function.  (Other approaches to Split-Apply-Combine such as plyr can also be used.  See http://4dpiecharts.com/2011/12/16/a-quick-primer-on-split-apply-combine-problems/). Here is an example showing the use of “by” to calculate the money weighted rate of return for each account in our sample data set.  Recap and Next Steps At this point, you’ve seen the power of R being used to calculate IRR.  There were several good things: R could easily work with the spreadsheets of sample data we were given R’s optimize() function provided a nice way to solve for IRR- it was both fast and allowed us to avoid having to code our own iterative approximation algorithm R was a convenient language to express the customer-specific variations, business-rules, and exceptions that often occur in real-world calculations- these could be easily added to our IRR functions The Split-Apply-Combine technique can be used to perform calculations of IRR for multiple accounts at once. However, there are several challenges yet to be conquered at this point in our story: The actual data that needs to be used lives in a database, not in a spreadsheet The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network The overall process needs to run fast- much faster than a single processor The actual data needs to be kept secured- another reason to not want to move it from the database and across the network And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes In our next blog post in this series, we will show you how Oracle R Enterprise solved these challenges.

    Read the article

  • Best practices on what data to collect in an in-app web analytics

    - by Anton Gogolev
    Hi! In our SaaSy webapp we need to collect Google Analytics-like data (like, what pages were visited, how many 404s where there, etc.). I wonder if there are any best practices on what pieces of information should be collected (like, IP, User Agent, etc.) and how should these logs be stored. Requirements on what statistics we're going to display are not yet fixed, but I want to have a starting point. Can you help me out with this? Thanks.

    Read the article

  • Form Journey Tracking using google analytics

    - by Nick Lowman
    Hi there, Is there anyway, using google analytics, to track a user's journey/selections through a long form so I can see where they drop off? I've created a 'contact us' form which starts with drop down menu which requires the user to make a choice i.e. apply for job, apply for funding etc. and then each option requires the user to fill out a form, which is completed over serval steps. Is there a way to track a user's individual form choices from their initial selection on the Contact Us page through to the form being submitted? That way I could see where in the form journey the users are dropping off.

    Read the article

  • Regular expression to validate a Google Analytics UA Number

    - by Otis
    It's not 100 percent clear to me that the Google Analytics UA Numbers are always 6 digits, a dash, and 2 digits as Google often mentions in their documentation. There are frequent counter-examples that use fewer than 6 for the account portion and 1-4 for the profile. All of the examples always show numbers but it's not even clear that they can't be letters. Does anyone know if Google has published a regex that exactly matches allowable UA Numbers? I'm adding this feature to the admin console of an application I work on and would like to validate the user input.

    Read the article

  • Google Analytics cookies

    - by wokena
    My problem: I erased all cookies from my computer. I sent Post request to the X server log and sent me a "normal" Set-Cookie with its parameters, but then somehow it will send request for Google Analytics (GA), in which the "strange" header (utma, utmac, utmcn ...). This happens when I send request in browser. But when I pass a request to login from my program (I programm in Ruby), so my server will return 302 Found, but no request to the GA sends. And I just need these headers ...

    Read the article

  • Google Analytics--My second profile to an existing domain not tracking

    - by Steve
    Hey Guys, I created a second profile to my Analytics account to track SEO keywords. This is a second profile for my existing working domain. It has been 4 days and I still do not see any traffic on this secondary profile. The profile shows a Staus of "Unknown". But I do not see a way to validate it like you have to when you create your first profile and this should not be necessary for a profile to my existing domain. Any ideas what I am missing? Thanks!

    Read the article

  • Big-O for Eight Year Olds?

    - by Jason Baker
    I'm asking more about what this means to my code. I understand the concepts mathematically, I just have a hard time wrapping my head around what they mean conceptually. For example, if one were to perform an O(1) operation on a data structure, I understand that the amount of operations it has to perform won't grow because there are more items. And an O(n) operation would mean that you would perform a set of operations on each element. Could somebody fill in the blanks here? Like what exactly would an O(n^2) operation do? And what the heck does it mean if an operation is O(n log(n))? And does somebody have to smoke crack to write an O(x!)?

    Read the article

< Previous Page | 25 26 27 28 29 30 31 32 33 34 35 36  | Next Page >