Search Results

Search found 58774 results on 2351 pages for 'ssis data tranformations'.

Page 21/2351 | < Previous Page | 17 18 19 20 21 22 23 24 25 26 27 28  | Next Page >

  • What data structure to use / data persistence

    - by Dave
    I have an app where I need one table of information with the following fields: field 1 - int or char field 2 - string (max 10 char) field 3 - string (max 20 char) field 4 - float I need the program to filter on field 1 based upon a segmented control and select a field 2 from a picker. From this data I need to look up field 4 to use in a calculation. Total records will be about 200. I never see it go above 400 - 500. I am going to use a singleton which I am able to do, I just need help with the structure for this with data persistence. What type of data structure should I use for this and should I use NSNumber, NSString, etc. or old data types like float, Char, etc. I thought about a struct put into an array but there is probably a better way. This is new to me so any help or reference to examples would be great. I also thought about a plist or dictionary but it looks like it is just a lookup and a field which obviously won't work. Core data looked like overkill to me. Also, with any recommendation how should I get initial data into it? I want the user to be able to edit and add to the database. Sorry for the old terms, you can see what generation I am from... Thanks in advance!!!!

    Read the article

  • Weird SSIS Configuration Error

    - by Christopher House
    I ran into an interesting SSIS issue that I thought I'd share in hopes that it may save someone from bruising their head after repeatedly banging it on the desk like I did.  I was trying to setup what I believe is referred to as "indirect configuration" in SSIS.  This is where you store your configuration in some repository like a database or a file, then store the location of that repository in an environment variable and use that to configure the connection to your configuration repository.  In my specific situation, I was using a SQL database.  I had this all working, but for reasons I'll not bore you with, I had to move my SSIS development to a new VM last week.  When I got my new VM, I set about creating a new package.  I finished up development on the package and started setting up configuration.  I created an OLE DB connection that pointed to my configuration table then went through the configuration wizard to have the connection string for this connection set through my environment variable.  I then went through the wizard to set another property through a value stored in the configuration table.  When I got to the point where you select the connection, my connection wasn't in the list: As you can see in the screen capture above, the ConfigurationDb connection isn't in the list of available SQL connections in the configuration wizard.  Strange.  I canceled out of the wizard, went to the properties for ConfigurationDb, tested the connection and it was successful.  I went back to the wizard again and this time ConfigurationDb was there.  I completed the wizard then went to test my package.  Unfortunately the package wouldn't run, I got the following error: Unfortunately, googling for this error code didn't help much as none of the results appears related to package configuration.  I did notice that when I went back through the package configuration and tried to edit a previously saved config entry,  I was getting the following error: I checked the connection string I had stored in my environment variable and noticed that indeed, it did not have a provider name.  I didn't recall having included one on my previous VM, but I figured I'd include it just to see what happened.  That made no difference at all.  After a day and a half of trying to figure out what the problem was, I'm pleased to report that through extensive trial and error, I have resolved the error. As it turns out, the person who setup this new VM for me named the server SQLSERVER2008.  This meant my configuration connection string was: Initial Catalog=SSISConfigDb;Data Source=SQLSERVER2008;Integrated Security=SSPI; Just for the heck of it, I tried changing it to: Initial Catalog=SSISConfigDb;Data Source=(local);Integrated Security=SSPI; That did the trick!  As soon as I restarted BIDS, I was able to run the package with no errors at all.  Crazy.  So, the moral of the story is, don't name your server SQLSERVER2008 if you want SSIS configuration to work when using SQL as your config store.

    Read the article

  • SSIS Technique to Remove/Skip Trailer and/or Bad Data Row in a Flat File

    - by Compudicted
    I noticed that the question on how to skip or bypass a trailer record or a badly formatted/empty row in a SSIS package keeps coming back on the MSDN SSIS Forum. I tried to figure out the reason why and after an extensive search inside the forum and outside it on the entire Web (using several search engines) I indeed found that it seems even thought there is a number of posts and articles on the topic none of them are employing the simplest and the most efficient technique. When I say efficient I mean the shortest time to solution for the fellow developers. OK, enough talk. Let’s face the problem: Typically a flat file (e.g. a comma delimited/CSV) needs to be processed (loaded into a database in most cases really). Oftentimes, such an input file is produced by some sort of an out of control, 3-rd party solution and would come in with some garbage characters and/or even malformed/miss-formatted rows. One such example could be this imaginary file: As you can see several rows have no data and there is an occasional garbage character (1, in this example on row #7). Our task is to produce a clean file that will only capture the meaningful data rows. As an aside, our output/target may be a database table, but for the purpose of this exercise we will simply re-format the source. Let’s outline our course of action to start off: Will use SSIS 2005 to create a DFT; The DFT will use a Flat File Source to our input [bad] flat file; We will use a Conditional Split to process the bad input file; and finally Dump the resulting data to a new [clean] file. Well, only four steps, let’s see if it is too much of work. 1: Start the BIDS and add a DFT to the Control Flow designer (I named it Process Dirty File DFT): 2, and 3: I had added the data viewer to just see what I am getting, alas, surprisingly the data issues were not seen it:   What really is the key in the approach it is to properly set the Conditional Split Transformation. Visually it is: and specifically its SSIS Expression LEN([After CS Column 0]) > 1 The point is to employ the right Boolean expression (yes, the Conditional Split accepts only Boolean conditions). For the sake of this post I re-named the Output Name “No Empty Rows”, but by default it will be named Case 1 (remember to drag your first column into the expression area)! You can close your Conditional Split now. The next part will be crucial – consuming the output of our Conditional Split. Last step - #4: Add a Flat File Destination or any other one you need. Click on the Conditional Split and choose the green arrow to drop onto the target. When you do so make sure you choose the No Empty Rows output and NOT the Conditional Split Default Output. Make the necessary mappings. At this point your package must look like: As the last step will run our package to examine the produced output file. F5: and… it looks great!

    Read the article

  • How to export SSIS to Microsoft Excel without additional software?

    - by Dr. Zim
    This question is long winded because I have been updating the question over a very long time trying to get SSIS to properly export Excel data. I managed to solve this issue, although not correctly. Aside from someone providing a correct answer, the solution listed in this question is not terrible. The only answer I found was to create a single row named range wide enough for my columns. In the named range put sample data and hide it. SSIS appends the data and reads metadata from the single row (that is close enough for it to drop stuff in it). The data takes the format of the hidden single row. This allows headers, etc. WOW what a pain in the butt. It will take over 450 days of exports to recover the time lost. However, I still love SSIS and will continue to use it because it is still way better than Filemaker LOL. My next attempt will be doing the same thing in the report server. Original question notes: If you are in Sql Server Integrations Services designer and want to export data to an Excel file starting on something other than the first line, lets say the forth line, how do you specify this? I tried going in to the Excel Destination of the Data Flow, changed the AccessMode to OpenRowSet from Variable, then set the variable to "YPlatters$A4:I20000" This fails saying it cannot find the sheet. The sheet is called YPlatters. I thought you could specify (Sheet$)(Starting Cell):(Ending Cell)? Update Apparently in Excel you can select a set of cells and name them with the name box. This allows you to select the name instead of the sheet without the $ dollar sign. Oddly enough, whatever the range you specify, it appends the data to the next row after the range. Oddly, as you add data, it increases the named selection's row count. Another odd thing is the data takes the format of the last line of the range specified. My header rows are bold. If I specify a range that ends with the header row, the data appends to the row below, and makes all the entries bold. if you specify one row lower, it puts a blank line between the header row and the data, but the data is not bold. Another update No matter what I try, SSIS samples the "first row" of the file and sets the metadata according to what it finds. However, if you have sample data that has a value of zero but is formatted as the first row, it treats that column as text and inserts numeric values with a single quote in front ('123.34). I also tried headers that do not reflect the data types of the columns. I tried changing the metadata of the Excel destination, but it always changes it back when I run the project, then fails saying it will truncate data. If I tell it to ignore errors, it imports everything except that column. Several days of several hours a piece later... Another update I tried every combination. A mostly working example is to create the named range starting with the column headers. Format your column headers as you want the data to look as the data takes on this format. In my example, these exist from A4 to E4, which is my defined range. SSIS appends to the row after the defined range, so defining A4 to E68 appends the rows starting at A69. You define the Connection as having the first row contains the field names. It takes on the metadata of the header row, oddly, not the second row, and it guesses at the data type, not the formatted data type of the column, i.e., headers are text, so all my metadata is text. If your headers are bold, so is all of your data. I even tried making a sample data row without success... I don't think anyone actually uses Excel with the default MS SSIS export. If you could define the "insert range" (A5 to E5) with no header row and format those columns (currency, not bold, etc.) without it skipping a row in Excel, this would be very helpful. From what I gather, noone uses SSIS to export Excel without a third party connection manager. Any ideas on how to set this up properly so that data is formatted correctly, i.e., the metadata read from Excel is proper to the real data, and formatting inherits from the first row of data, not the headers in Excel? One last update (July 17, 2009) I got this to work very well. One thing I added to Excel was the IMEX=1 in the Excel connection string: "Excel 8.0;HDR=Yes;IMEX=1". This forces Excel (I think) to look at all rows to see what kind of data is in it. Generally, this does not drop information, say for instance if you have a zip code then about 9 rows down you have a zip+4, Excel without this blanks that field entirely without error. With IMEX=1, it recognizes that Zip is actually a character field instead of numeric. And of course, one more update (August 27, 2009) The IMEX=1 will succeed importing data with missing contents in the first 8 rows, but it will fail exporting data where no data exists. So, have it on your import connection string, but not your export Excel connection string. I have to say, after so much fiddling, it works pretty well.

    Read the article

  • Using a "white list" for extracting terms for Text Mining

    - by [email protected]
    In Part 1 of my post on "Generating cluster names from a document clustering model" (part 1, part 2, part 3), I showed how to build a clustering model from text documents using Oracle Data Miner, which automates preparing data for text mining. In this process we specified a custom stoplist and lexer and relied on Oracle Text to identify important terms.  However, there is an alternative approach, the white list, which uses a thesaurus object with the Oracle Text CTXRULE index to allow you to specify the important terms. INTRODUCTIONA stoplist is used to exclude, i.e., black list, specific words in your documents from being indexed. For example, words like a, if, and, or, and but normally add no value when text mining. Other words can also be excluded if they do not help to differentiate documents, e.g., the word Oracle is ubiquitous in the Oracle product literature. One problem with stoplists is determining which words to specify. This usually requires inspecting the terms that are extracted, manually identifying which ones you don't want, and then re-indexing the documents to determine if you missed any. Since a corpus of documents could contain thousands of words, this could be a tedious exercise. Moreover, since every word is considered as an individual token, a term excluded in one context may be needed to help identify a term in another context. For example, in our Oracle product literature example, the words "Oracle Data Mining" taken individually are not particular helpful. The term "Oracle" may be found in nearly all documents, as with the term "Data." The term "Mining" is more unique, but could also refer to the Mining industry. If we exclude "Oracle" and "Data" by specifying them in the stoplist, we lose valuable information. But it we include them, they may introduce too much noise. Still, when you have a broad vocabulary or don't have a list of specific terms of interest, you rely on the text engine to identify important terms, often by computing the term frequency - inverse document frequency metric. (This is effectively a weight associated with each term indicating its relative importance in a document within a collection of documents. We'll revisit this later.) The results using this technique is often quite valuable. As noted above, an alternative to the subtractive nature of the stoplist is to specify a white list, or a list of terms--perhaps multi-word--that we want to extract and use for data mining. The obvious downside to this approach is the need to specify the set of terms of interest. However, this may not be as daunting a task as it seems. For example, in a given domain (Oracle product literature), there is often a recognized glossary, or a list of keywords and phrases (Oracle product names, industry names, product categories, etc.). Being able to identify multi-word terms, e.g., "Oracle Data Mining" or "Customer Relationship Management" as a single token can greatly increase the quality of the data mining results. The remainder of this post and subsequent posts will focus on how to produce a dataset that contains white list terms, suitable for mining. CREATING A WHITE LIST We'll leverage the thesaurus capability of Oracle Text. Using a thesaurus, we create a set of rules that are in effect our mapping from single and multi-word terms to the tokens used to represent those terms. For example, "Oracle Data Mining" becomes "ORACLEDATAMINING." First, we'll create and populate a mapping table called my_term_token_map. All text has been converted to upper case and values in the TERM column are intended to be mapped to the token in the TOKEN column. TERM                                TOKEN DATA MINING                         DATAMINING ORACLE DATA MINING                  ORACLEDATAMINING 11G                                 ORACLE11G JAVA                                JAVA CRM                                 CRM CUSTOMER RELATIONSHIP MANAGEMENT    CRM ... Next, we'll create a thesaurus object my_thesaurus and a rules table my_thesaurus_rules: CTX_THES.CREATE_THESAURUS('my_thesaurus', FALSE); CREATE TABLE my_thesaurus_rules (main_term     VARCHAR2(100),                                  query_string  VARCHAR2(400)); We next populate the thesaurus object and rules table using the term token map. A cursor is defined over my_term_token_map. As we iterate over  the rows, we insert a synonym relationship 'SYN' into the thesaurus. We also insert into the table my_thesaurus_rules the main term, and the corresponding query string, which specifies synonyms for the token in the thesaurus. DECLARE   cursor c2 is     select token, term     from my_term_token_map; BEGIN   for r_c2 in c2 loop     CTX_THES.CREATE_RELATION('my_thesaurus',r_c2.token,'SYN',r_c2.term);     EXECUTE IMMEDIATE 'insert into my_thesaurus_rules values                        (:1,''SYN(' || r_c2.token || ', my_thesaurus)'')'     using r_c2.token;   end loop; END; We are effectively inserting the token to return and the corresponding query that will look up synonyms in our thesaurus into the my_thesaurus_rules table, for example:     'ORACLEDATAMINING'        SYN ('ORACLEDATAMINING', my_thesaurus)At this point, we create a CTXRULE index on the my_thesaurus_rules table: create index my_thesaurus_rules_idx on        my_thesaurus_rules(query_string)        indextype is ctxsys.ctxrule; In my next post, this index will be used to extract the tokens that match each of the rules specified. We'll then compute the tf-idf weights for each of the terms and create a nested table suitable for mining.

    Read the article

  • Using Stored Procedures in SSIS

    - by dataintegration
    The SSIS Data Flow components: the source task and the destination task are the easiest way to transfer data in SSIS. Some data transactions do not fit this model, they are procedural tasks modeled as stored procedures. In this article we show how you can call stored procedures available in RSSBus ADO.NET Providers from SSIS. In this article we will use the CreateJob and the CreateBatch stored procedures available in RSSBus ADO.NET Provider for Salesforce, but the same steps can be used to call a stored procedure in any of our data providers. Step 1: Open Visual Studio and create a new Integration Services Project. Step 2: Add a new Data Flow Task to the Control Flow window. Step 3: Open the Data Flow Task and add a Script Component to the data flow pane. A dialog box will pop-up allowing you to select the Script Component Type: pick the source type as we will be outputting columns from our stored procedure. Step 4: Double click the Script Component to open the editor. Step 5: In the "Inputs and Outputs" settings, enter all the columns you want to output to the data flow. Ensure the correct data type has been set for each output. You can check the data type by selecting the output and then changing the "DataType" property from the property editor. In our example, we'll add the column JobID of type String. Step 6: Select the "Script" option in the left-hand pane and click the "Edit Script" button. This will open a new Visual Studio window with some boiler plate code in it. Step 7: In the CreateOutputRows() function you can add code that executes the stored procedures included with the Salesforce Component. In this example we will be using the CreateJob and CreateBatch stored procedures. You can find a list of the available stored procedures along with their inputs and outputs in the product help. //Configure the connection string to your credentials String connectionString = "Offline=False;user=myusername;password=mypassword;access token=mytoken;"; using (SalesforceConnection conn = new SalesforceConnection(connectionString)) { //Create the command to call the stored procedure CreateJob SalesforceCommand cmd = new SalesforceCommand("CreateJob", conn); cmd.CommandType = CommandType.StoredProcedure; cmd.Parameters.Add(new SalesforceParameter("ObjectName", "Contact")); cmd.Parameters.Add(new SalesforceParameter("Action", "insert")); //Execute CreateJob //CreateBatch requires JobID as input so we store this value for later SalesforceDataReader rdr = cmd.ExecuteReader(); String JobID = ""; while (rdr.Read()) { JobID = (String)rdr["JobID"]; } //Create the command for CreateBatch, for this example we are adding two new rows SalesforceCommand batCmd = new SalesforceCommand("CreateBatch", conn); batCmd.CommandType = CommandType.StoredProcedure; batCmd.Parameters.Add(new SalesforceParameter("JobID", JobID)); batCmd.Parameters.Add(new SalesforceParameter("Aggregate", "<Contact><Row><FirstName>Bill</FirstName>" + "<LastName>White</LastName></Row><Row><FirstName>Bob</FirstName><LastName>Black</LastName></Row></Contact>")); //Execute CreateBatch SalesforceDataReader batRdr = batCmd.ExecuteReader(); } Step 7b: If you had specified output columns earlier, you can now add data into them using the UserComponent Output0Buffer. For example, we had set an output column called JobID of type String so now we can set a value for it. We will modify the DataReader that contains the output of CreateJob like so:. while (rdr.Read()) { Output0Buffer.AddRow(); JobID = (String)rdr["JobID"]; Output0Buffer.JobID = JobID; } Step 8: Note: You will need to modify the connection string to include your credentials. Also ensure that the System.Data.RSSBus.Salesforce assembly is referenced and include the following using statements to the top of the class: using System.Data; using System.Data.RSSBus.Salesforce; Step 9: Once you are done editing your script, save it, and close the window. Click OK in the Script Transformation window to go back to the main pane. Step 10: If had any outputs from the Script Component you can use them in your data flow. For example we will use a Flat File Destination. Configure the Flat File Destination to output the results to a file, and you should see the JobId in the file. Step 11: Your project should be ready to run.

    Read the article

  • ADO.NET (WCF) Data Services Query Interceptor Hangs IIS

    - by PreMagination
    I have an ADO.NET Data Service that's supposed to provide read-only access to a somewhat complex database. Logically I have table-per-type (TPT) inheritance in my data model but the EDM doesn't implement inheritance. (Limitation of EF and navigation properties on derived types. STILL not fixed in EF4!) I can query my EDM directly (using a separate project) using a copy of the query I'm trying to run against the web service, results are returned within 10 seconds. Disabling the query interceptors I'm able to make the same query against the web service, results are returned similarly quickly. I can enable some of the query interceptors and the results are returned slowly, up to a minute or so later. Alternatively, I can enable all the query interceptors, expand less of the properties on the main object I'm querying, and results are returned in a similar period of time. (I've increased some of the timeout periods) Up til this point Sql Profiler indicates the slow-down is the database. (That's a post for a different day) But when I enable all my query interceptors and expand all the properties I'd like to have the IIS worker process pegs the CPU for 20 minutes and a query is never even made against the database. This implies to me that yes, my implementation probably sucks but regardless the Data Services "tier" is having an issue it shouldn't. WCF tracing didn't reveal anything interesting to my untrained eye. Details: Data model: Agent-Person-Student Student has a collection of referrals Students and referrals are private, queries against the web service should only return "your" students and referrals. This means Person and Agent need to be filtered too. Other entities (Agent-Organization-School) can be accessed by anyone who has authenticated. The existing security model is poorly suited to perform this type of filtering for this type of data access, the query interceptors are complicated and cause EF to generate some entertaining sql queries. Sample Interceptor [QueryInterceptor("Agents")] public Expression<Func<Agent, Boolean>> OnQueryAgents() { //Agent is a Person(1), Educator(2), Student(3), or Other Person(13); allow if scope permissions exist return ag => (ag.AgentType.AgentTypeId == 1 || ag.AgentType.AgentTypeId == 2 || ag.AgentType.AgentTypeId == 3 || ag.AgentType.AgentTypeId == 13) && ag.Person.OrganizationPersons.Count<OrganizationPerson>(op => op.Organization.ScopePermissions.Any<ScopePermission> (p => p.ApplicationRoleAccount.Account.UserName == HttpContext.Current.User.Identity.Name && p.ApplicationRoleAccount.Application.ApplicationId == 124) || op.Organization.HierarchyDescendents.Any<OrganizationsHierarchy>(oh => oh.AncestorOrganization.ScopePermissions.Any<ScopePermission> (p => p.ApplicationRoleAccount.Account.UserName == HttpContext.Current.User.Identity.Name && p.ApplicationRoleAccount.Application.ApplicationId == 124))) > 0; } The query interceptors for Person, Student, Referral are all very similar, ie they traverse multiple same/similar tables to look for ScopePermissions as above. Sample Query var referrals = (from r in service.Referrals .Expand("Organization/ParentOrganization") .Expand("Educator/Person/Agent") .Expand("Student/Person/Agent") .Expand("Student") .Expand("Grade") .Expand("ProblemBehavior") .Expand("Location") .Expand("Motivation") .Expand("AdminDecision") .Expand("OthersInvolved") where r.DateCreated >= coupledays && r.DateDeleted == null select r); Any suggestions or tips would be greatly associated, for fixing my current implementation or in developing a new one, with the caveat that the database can't be changed and that ultimately I need to expose a large portion of the database via a web service that limits data access to the data authorized for, for the purpose of data integration with multiple outside parties. THANK YOU!!!

    Read the article

  • Oracle Insurance Gets Innovative with Insurance Business Intelligence

    - by nicole.bruns(at)oracle.com
    Oracle Insurance announced yesterday the availability of Oracle Insurance Insight 7.0, an insurance-specific data warehouse and business intelligence (BI) system that transforms the traditional approach to BI by involving business users in the creation and maintenance."Rapid access to business intelligence is essential to compete and thrive in today's insurance industry," said Srini Venkatasantham, vice president, Product Strategy, Oracle Insurance. "The adaptive data modeling approach of Oracle Insurance Insight 7.0, combined with the insurance-specific data model, offers global insurance companies a faster, easier way to get the intelligence they need to make better-informed business decisions." New Features in Oracle Insurance 7.0 include:"Adaptive Data Modeling" via the new warehouse palette: Gives business users the power to configure lines of business via an easy-to-use warehouse palette tool. Oracle Insurance Insight then automatically creates data warehouse elements - such as line-specific database structures and extract-transform-load (ETL) processes -speeding up time-to-value for BI initiatives. Out-of-the-box insurance models or create-from-scratch option: Includes pre-built content and interfaces for six Property and Casualty (P&C) lines. Additionally, insurers can use the warehouse palette to deploy any and all P&C or General Insurance lines of business from scratch, helping insurers support operations in any country.Leverages Oracle technologies: In addition to Oracle Business Intelligence Enterprise Edition, the solution includes Oracle Database 11g as well as Oracle Data Integrator Enterprise Edition 11g, which delivers Extract, Load and Transform (E-L-T) architecture and eliminates the need for a separate transformation server. Additionally, the expanded Oracle technology infrastructure enables support for Oracle Exadata. Martina Conlon, a Principal with Novarica's Insurance practice, and author of Business Intelligence in Insurance: Current State, Challenges, and Expectations says, "The need for continued investment by insurers in business intelligence capabilities is widely understood, and the industry is acting. Arming the business intelligence implementation with predefined insurance specific content, and flexible and configurable technology will get these projects up and running faster."Learn moreTo see a demo of the Oracle Insurance Insight system, click hereTo read the press announcement, click here

    Read the article

  • SQL SERVER – Installing Data Quality Services (DQS) on SQL Server 2012

    - by pinaldave
    Data Quality Services is very interesting enhancements in SQL Server 2012. My friend and SQL Server Expert Govind Kanshi have written an excellent article on this subject earlier on his blog. Yesterday I stumbled upon his blog one more time and decided to experiment myself with DQS. I have basic understanding of DQS and MDS so I knew I need to start with DQS Client. However, when I tried to find DQS Client I was not able to find it under SQL Server 2012 installation. I quickly realized that I needed to separately install the DQS client. You will find the DQS installer under SQL Server 2012 >> Data Quality Services directory. The pre-requisite of DQS is Master Data Services (MDS) and IIS. If you have not installed IIS, you can follow the simple steps and install IIS in your machine. Once the pre-requisites are installed, click on MDS installer once again and it will install DQS just fine. Be patient with the installer as it can take a bit longer time if your machine is low on configurations. Once the installation is over you will be able to expand SQL Server 2012 >> Data Quality Services directory and you will notice that it will have a new item called Data Quality Client.  Click on it and it will open the client. Well, in future blog post we will go over more details about DQS and detailed practical examples. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, SQL Utility, T SQL, Technology Tagged: Data Quality Services

    Read the article

  • make-like build tools for data?

    - by miku
    Make is a standard tools for building software. But make decides whether a target needs to be regenerated by comparing file modification times. Are there any proven, preferably small tools that handle builds not for software but for data? Something that regenerates targets not only on mod times but on certain other properties (e.g. completeness). (Or alternatively some paper that describes such a tool.) As illustration: I'd like to automate the following process: get data (e.g. a tarball) from some regularly updated source copy somewhere if it's not there (based e.g. on some filename-scheme) convert the files to different format (but only if there aren't successfully converted ones there - e.g. from a previous attempt - custom comparison routine) for each file find a certain data element and fetch some additional file from say an URL, but only if that hasn't been downloaded yet (decide on existence of file and file "freshness") finally compute something (e.g. word count for something identifiable and store it in the database, but only if the DB does not have an entry for that exact ID yet) Observations: there are different stages each stage is usually simple to compute or implement in isolation each stage may be simple, but the data volume may be large each stage may produce a few errors each stage may have different signals, on when (re)processing is needed Requirements: builds should be interruptable and idempotent (== robust) when interrupted, already processed objects should be reused to speedup the next run data paths should be easy to adjust (simple syntax, nothing new to learn, internal dsl would be ok) some form of dependency graph, that describes the process would be nice for later visualizations should leverage existing programs, if possible I've done some research on make alternatives like rake and have worked a lot with ant and maven in the past. All these tools naturally focus on code and software build, not on data builds. A system we have in place now for a task similar to the above is pretty much just shell scripts, which are compact (and are a ok glue for a variety of other programs written in other languages), so I wonder if worse is better?

    Read the article

  • Social Analytics in your current data

    - by Dan McGrath
    By now everyone is aware of the massive boom in social-networking (Twitter, Facebook, LinkedIn) and obviously a big part of its business model revolves around being able to mine this data to create information that can be used to make money for someone. Gartner has identified 'Social Analytics' as one of the top 10 strategic technologies for 2011. Has anyone looked at their existing data structures to determine if they could extract a social graph and then perform further data mining against this? How does it fit in with your other strategic development strategies? What information are you trying to extract from the data? Take for example, a bank. They could conceivably determine a social graph through account relationships and transactions. Obviously there would be open edges on the graph where funds enter/leave the institute, but that shouldn't detract from the usefulness of the data. I'm looking for actual examples with the answers, as well as why/how they did it. References to other sites will be greatly appreciated. Note: I'm not at all referring to mining data out of actual social networks.

    Read the article

  • The Oldest Big Data Problem: Parsing Human Language

    - by dan.mcclary
    There's a new whitepaper up on Oracle Technology Network which details the use of Digital Reasoning Systems' Synthesys software on Oracle Big Data Appliance.  Digital Reasoning's approach is inherently "big data friendly," as it leverages multiple components of the Hadoop ecosystem.  Moreover, the paper addresses the oldest big data problem of them all: extracting knowledge from human text.   You can find the paper here.   From the Executive Summary: There is a wealth of information to be extracted from natural language, but that extraction is challenging. The volume of human language we generate constitutes a natural Big Data problem, while its complexity and nuance requires a particular expertise to model and mine. In this paper we illustrate the impressive combination of Oracle Big Data Appliance and Digital Reasoning Synthesys software. The combination of Synthesys and Big Data Appliance makes it possible to analyze tens of millions of documents in a matter of hours. Moreover, this powerful combination achieves four times greater throughput than conducting the equivalent analysis on a much larger cloud-deployed Hadoop cluster.

    Read the article

  • Perm SSIS Developer Urgently Required

    - by blakmk
      Job Role To provide dedicated data services support to the company, by designing, creating, maintaining and enhancing database objects, ensuring data quality, consistency and integrity. Migrating data from various sources to central SQL 2008 data warehouse will be the primary function. Migration of data from bespoke legacy database’s to SQL 2008 data warehouse. Understand key business requirements, Liaising with various aspects of the company. Create advanced transformations of data, with focus on data cleansing, redundant data and duplication. Creating complex business rules regarding data services, migration, Integrity and support (Best Practices). Experience ·         Minimum 3 year SSIS experience, in a project or BI Development role and involvement in at least 3 full ETL project life cycles, using the following methodologies and tools o    Excellent knowledge of ETL concepts including data migration & integrity, focusing on SSIS. o    Extensive experience with SQL 2005 products, SQL 2008 desirable. o    Working knowledge of SSRS and its integration with other BI products. o    Extensive knowledge of T-SQL, stored procedures, triggers (Table/Database), views, functions in particular coding and querying. o    Data cleansing and harmonisation. o    Understanding and knowledge of indexes, statistics and table structure. o    SQL Agent – Scheduling jobs, optimisation, multiple jobs, DTS. o    Troubleshoot, diagnose and tune database and physical server performance. o    Knowledge and understanding of locking, blocks, table and index design and SQL configuration. ·         Demonstrable ability to understand and analyse business processes. ·         Experience in creating business rules on best practices for data services. ·         Experience in working with, supporting and troubleshooting MS SQL servers running enterprise applications ·         Proven ability to work well within a team and liaise with other technical support staff such as networking administrators, system administrators and support engineers. ·         Ability to create formal documentation, work procedures, and service level agreements. ·         Ability to communicate technical issues at all levels including to a non technical audience. ·         Good working knowledge of MS Word, Excel, PowerPoint, Visio and Project.   Location Based in Crawley with possibility of some remote working Contact me for more info: http://sqlblogcasts.com/blogs/blakmk/contact.aspx      

    Read the article

  • Uses of persistent data structures in non-functional languages

    - by Ray Toal
    Languages that are purely functional or near-purely functional benefit from persistent data structures because they are immutable and fit well with the stateless style of functional programming. But from time to time we see libraries of persistent data structures for (state-based, OOP) languages like Java. A claim often heard in favor of persistent data structures is that because they are immutable, they are thread-safe. However, the reason that persistent data structures are thread-safe is that if one thread were to "add" an element to a persistent collection, the operation returns a new collection like the original but with the element added. Other threads therefore see the original collection. The two collections share a lot of internal state, of course -- that's why these persistent structures are efficient. But since different threads see different states of data, it would seem that persistent data structures are not in themselves sufficient to handle scenarios where one thread makes a change that is visible to other threads. For this, it seems we must use devices such as atoms, references, software transactional memory, or even classic locks and synchronization mechanisms. Why then, is the immutability of PDSs touted as something beneficial for "thread safety"? Are there any real examples where PDSs help in synchronization, or solving concurrency problems? Or are PDSs simply a way to provide a stateless interface to an object in support of a functional programming style?

    Read the article

  • Translate report data export from RUEI into HTML for import into OpenOffice Calc Spreadsheets

    - by [email protected]
    A common question of users is, How to import the data from the automated data export of Real User Experience Insight (RUEI) into tools for archiving, dashboarding or combination with other sets of data.XML is well-suited for such a translation via the companion Extensible Stylesheet Language Transformations (XSLT). Basically XSLT utilizes XSL, a template on what to read from your input XML data file and where to place it into the target document. The target document can be anything you like, i.e. XHTML, CSV, or even a OpenOffice Spreadsheet, etc. as long as it is a plain text format.XML 2 OpenOffice.org SpreadsheetFor the XSLT to work as an OpenOffice.org Calc Import Filter:How to add an XML Import Filter to OpenOffice CalcStart OpenOffice.org Calc andselect Tools > XML Filter SettingsNew...Fill in the details as follows:Filter name: RUEI Import filterApplication: OpenOffice.org Calc (.ods)Name of file type: Oracle Real User Experience InsightFile extension: xmlSwitch to the transformation tab and enter/select the following leaving the rest untouchedXSLT for import: ruei_report_data_import_filter.xslPlease see at the end of this blog post for a download of the referenced file.Select RUEI Import filter from list and Test XSLTClick on Browse to selectTransform file: export.php.xmlOpenOffice.org Calc will transform and load the XML file you retrieved from RUEI in a human-readable format.You can now select File > Open... and change the filetype to open your RUEI exports directly in OpenOffice.org Calc, just like any other a native Spreadsheet format.Files of type: Oracle Real User Experience Insight (*.xml)File name: export.php.xml XML 2 XHTMLMost XML-powered browsers provides for inherent XSL Transformation capabilities, you only have to reference the XSLT Stylesheet in the head of your XML file. Then open the file in your favourite Web Browser, Firefox, Opera, Safari or Internet Explorer alike.<?xml version="1.0" encoding="ISO-8859-1"?><!-- inserted line below --> <?xml-stylesheet type="text/xsl" href="ruei_report_data_export_2_xhtml.xsl"?><!-- inserted line above --><report>You can find a patched example export from RUEI plus the above referenced XSL-Stylesheets here: export.php.xml - Example report data export from RUEI ruei_report_data_export_2_xhtml.xsl - RUEI to XHTML XSL Transformation Stylesheetruei_report_data_import_filter.xsl - OpenOffice.org XML import filter for RUEI report export data If you would like to do things like this on the command line you can use either Xalan or xsltproc.The basic command syntax for xsltproc is very simple:xsltproc -o output.file stylesheet.xslt inputfile.xmlYou can use this with the above two stylesheets to translate RUEI Data Exports into XHTML and/or OpenOffice.org Calc ODS-Format. Or you could write your own XSLT to transform into Comma separated Value lists.Please let me know what you think or do with this information in the comments below.Kind regards,Stefan ThiemeReferences used:OpenOffice XML Filter - Create XSLT filters for import and export - http://user.services.openoffice.org/en/forum/viewtopic.php?f=45&t=3490SUN OpenOffice.org XML File Format 1.0 - http://xml.openoffice.org/xml_specification.pdf

    Read the article

  • Data Auditor by Example

    - by Jinjin.Wang
    OWB has a node Data Auditors under Oracle Module in Projects Navigator. What is data auditor and how to use it? I will give an introduction to data auditor and show its usage by examples. Data auditor is an important tool in ensuring that data quality levels meet business requirements. Data auditor validates data against a set of data rules to determine which records comply and which do not. It gathers statistical metrics on how well the data in a system complies with a rule by auditing and marking how many errors are occurring against the audited table. Data auditors are typically scheduled for regular execution as part of a process flow, to monitor the quality of the data in an operational environment such as a data warehouse or ERP system, either immediately after updates like data loads, or at regular intervals. How to use data auditor to monitor data quality? Only objects with data rules can be monitored, so the first step is to define data rules according to business requirements and apply them to the objects you want to monitor. The objects can be tables, views, materialized views, and external tables. Secondly create a data auditor containing the objects. You can configure the data auditor and set physical deployment parameters for it as optional, which will be used while running the data auditor. Then deploy and run the data auditor either manually or as part of the process flow. After execution, the data auditor sets several output values, and records that are identified as not complying with the defined data rules contained in the data auditor are written to error tables. Here is an example. We have two tables DEPARTMENTS and EMPLOYEES (see pic-1 and pic-2. Click here for DDL and data) imported into OWB. We want to gather statistical metrics on how well data in these two tables satisfies the following requirements: a. Values of the EMPLOYEES.EMPLOYEE_ID attribute are three-digit numbers. b. Valid values for EMPLOYEES.JOB_ID are IT_PROG, SA_REP, SH_CLERK, PU_CLERK, and ST_CLERK. c. EMPLOYEES.EMPLOYEE_ID is related to DEPARTMENTS.MANAGER_ID. Pic-1 EMPLOYEES Pic-2 DEPARTMENTS 1. To determine legal data within EMPLOYEES or legal relationships between data in different columns of the two tables, firstly we define data rules based on the three requirements and apply them to tables. a. The first requirement is about patterns that an attribute is allowed to conform to. We create a Domain Pattern List data rule EMPLOYEE_PATTERN_RULE here. The pattern is defined in the Oracle Database regular expression syntax as ^([0-9]{3})$ Apply data rule EMPLOYEE_PATTERN_RULE to table EMPLOYEES.

    Read the article

  • Big Data – Buzz Words: What is Hadoop – Day 6 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned what is NoSQL. In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – Hadoop. What is Hadoop? Apache Hadoop is an open-source, free and Java based software framework offers a powerful distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs applications on large clusters of commodity hardware and it processes thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s MapReduce and Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and high availability. What are the core components of Hadoop? There are two major components of the Hadoop framework and both fo them does two of the important task for it. Hadoop MapReduce is the method to split a larger data problem into smaller chunk and distribute it to many different commodity servers. Each server have their own set of resources and they have processed them locally. Once the commodity server has processed the data they send it back collectively to main server. This is effectively a process where we process large data effectively and efficiently. (We will understand this in tomorrow’s blog post). Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference between any other file system and Hadoop. When we move a file on HDFS, it is automatically split into many small pieces. These small chunks of the file are replicated and stored on other servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day after tomorrow’s blog post). Besides above two core components Hadoop project also contains following modules as well. Hadoop Common: Common utilities for the other Hadoop modules Hadoop Yarn: A framework for job scheduling and cluster resource management There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will gradually explore in later blog posts. A Multi-node Hadoop Cluster Architecture Now let us quickly see the architecture of the a multi-node Hadoop cluster. A small Hadoop cluster includes a single master node and multiple worker or slave node. As discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and another is of HDFC Layer. Each of these layer have its own relevant component. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or compute node. The matter of the fact that is the key feature of the Hadoop. In this introductory blog post we will stop here while describing the architecture of Hadoop. In a future blog post of this 31 day series we will explore various components of Hadoop Architecture in Detail. Why Use Hadoop? There are many advantages of using Hadoop. Let me quickly list them over here: Robust and Scalable – We can add new nodes as needed as well modify them. Affordable and Cost Effective – We do not need any special hardware for running Hadoop. We can just use commodity server. Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured and unstructured data. Highly Available and Fault Tolerant – When a node fails, the Hadoop framework automatically fails over to another node. Why Hadoop is named as Hadoop? In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo. Doug Cutting named Hadoop after his son’s toy elephant. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – MapReduce. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Master Data Management for Location Data - Oracle Site Hub

    - by david.butler(at)oracle.com
    Most MDM discussions cover key domains such as customer, supplier, product, service, and reference data. It is usually understood that these domains have complex structures and hundreds if not thousands of attributes that need governing. Location, on the other hand, strikes most people as address data. How hard can that be? But for many industries, locations are complex, and site information is critical to efficient operations and relevant analytics. Retail stores and malls, bank branches, construction sites come to mind. But one of the best industries for illustrating the power of a site mastering application is Oil & Gas.   Oracle's Master Data Management solution for location data is the Oracle Site Hub. It is a location mastering solution that enables organizations to centralize site and location specific information from heterogeneous systems, creating a single view of site information that can be leveraged across all functional departments and analytical systems.   Let's take a look at the location entities the Oracle Site Hub can manage for the Oil & Gas industry: organizations, property, land, buildings, roads, oilfield, service center, inventory site, real estate, facilities, refineries, storage tanks, vendor locations, businesses, assets; project site, area, well, basin, pipelines, critical infrastructure, offshore platform, compressor station, gas station, etc. Any site can be classified into multiple hierarchies, like organizational hierarchy, operational hierarchy, geographic hierarchy, divisional hierarchies and so on. Any site can also be associated to multiple clusters, i.e. collections of sites, and these can be used as a foundation for driving reporting, analysis, organize daily work, etc. Hierarchies can also be used to model entities which are structured or non-structured collections of nodes, like for example routes, pipelines and more. The User Defined Attribute Framework provides the needed infrastructure to add single row attributes groups like well base attributes (well IDs, well type, well structure and key characterizing measures, and more) and well geometry, and multi row attribute groups like well applications, permits, production data, activities, operations, logs, treatments, tests, drills, treatments, and KPIs. Site Hub can also model areas, lands, fields, basins, pools, platforms, eco-zones, and stratigraphic layers as specific sites, tracking their base attributes, aliases, descriptions, subcomponents and more. Midstream entities (pipelines, logistic sites, pump stations) and downstream entities (cylinders, tanks, inventories, meters, partner's sites, routes, facilities, gas stations, and competitor sites) can also be easily modeled, together with their specific attributes and relationships. Site Hub can store any type of unstructured data associated to a site. This could be stored directly or on an external content management solution, like Oracle Universal Content Management. Considering a well, for example, Site Hub can store any relevant associated multimedia file such as: CAD drawings of the well profile, structure and/or parts, engineering documents, contracts, applications, permits, logs, pictures, photos, videos and more. For any site entity, Site Hub can associate all the related assets and equipments at the site, as well as all relationships between sites, between a site and multiple parties, and between a site and any purchasable or sellable item, over time. Items can be equipment, instruments, facilities, services, products, production entities, production facilities (pipelines, batteries, compressor stations, gas plants, meters, separators, etc.), support facilities (rigs, roads, transmission or radio towers, airstrips, etc.), supplier products and services, catalogs, and more. Items can just be associated to sites using standard Site Hub features, or they can be fully mastered by implementing Oracle Product Hub. Site locations (addresses or geographical coordinates) are also managed with out-of-the-box address geo-coding capabilities coupled with Google Maps integration to deliver powerful mapping capabilities and spatial data analysis. Locations can be shared between different sites. Centered on the site location, any site can also have associated areas. Site Hub can master any site location specific information, like for example cadastral, ownership, jurisdictional, geological, seismic and more, and any site-centric area specific information, like for example economical, political, risk, weather, logistic, traffic information and more. Now if anyone ever asks you why locations need MDM, think about how all these Oil & Gas entities and attributes would translate into your business locations. To learn more about Oracle's full MDM solution for the digital oil field, here is a link to Roberto Negro's outstanding whitepaper: Oracle Site Master Data Management for mastering wells and other PPDM entities in a digital oilfield context  

    Read the article

  • Optimal Data Structure for our own API

    - by vermiculus
    I'm in the early stages of writing an Emacs major mode for the Stack Exchange network; if you use Emacs regularly, this will benefit you in the end. In order to minimize the number of calls made to Stack Exchange's API (capped at 10000 per IP per day) and to just be a generally responsible citizen, I want to cache the information I receive from the network and store it in memory, waiting to be accessed again. I'm really stuck as to what data structure to store this information in. Obviously, it is going to be a list. However, as with any data structure, the choice must be determined by what data is being stored and what how it will be accessed. What, I would like to be able to store all of this information in a single symbol such as stack-api/cache. So, without further ado, stack-api/cache is a list of conses keyed by last update: `(<csite> <csite> <csite>) where <csite> would be (1362501715 . <site>) At this point, all we've done is define a simple association list. Of course, we must go deeper. Each <site> is a list of the API parameter (unique) followed by a list questions: `("codereview" <cquestion> <cquestion> <cquestion>) Each <cquestion> is, you guessed it, a cons of questions with their last update time: `(1362501715 <question>) (1362501720 . <question>) <question> is a cons of a question structure and a list of answers (again, consed with their last update time): `(<question-structure> <canswer> <canswer> <canswer> and ` `(1362501715 . <answer-structure>) This data structure is likely most accurately described as a tree, but I don't know if there's a better way to do this considering the language, Emacs Lisp (which isn't all that different from the Lisp you know and love at all). The explicit conses are likely unnecessary, but it helps my brain wrap around it better. I'm pretty sure a <csite>, for example, would just turn into (<epoch-time> <api-param> <cquestion> <cquestion> ...) Concerns: Does storing data in a potentially huge structure like this have any performance trade-offs for the system? I would like to avoid storing extraneous data, but I've done what I could and I don't think the dataset is that large in the first place (for normal use) since it's all just human-readable text in reasonable proportion. (I'm planning on culling old data using the times at the head of the list; each inherits its last-update time from its children and so-on down the tree. To what extent this cull should take place: I'm not sure.) Does storing data like this have any performance trade-offs for that which must use it? That is, will set and retrieve operations suffer from the size of the list? Do you have any other suggestions as to what a better structure might look like?

    Read the article

  • How do you handle the fetchxml result data?

    - by Luke Baulch
    I have avoided working with fetchxml as I have been unsure the best way to handle the result data after calling crmService.Fetch(fetchXml). In a couple of situations, I have used an XDocument with LINQ to retrieve the data from this data structure, such as: XDocument resultset = XDocument.Parse(_service.Fetch(fetchXml)); if (resultset.Root == null || !resultset.Root.Elements("result").Any()) { return; } foreach (var displayItem in resultset.Root.Elements("result").Select(item => item.Element(displayAttributeName)).Distinct()) { if (displayItem!= null && displayItem.Value != null) { dropDownList.Items.Add(displayItem.Value); } } What is the best way to handle fetchxml result data, so that it can be easily used. Applications such as passing these records into an ASP.NET datagrid would be quite useful.

    Read the article

  • Generate Entity Data Model from Data Contract

    - by CSmooth.net
    I would like to find a fast way to convert a Data Contract to a Entity Data Model. Consider the following Data Contract: [DataContract] class PigeonHouse { [DataMember] public string housename; [DataMember] public List<Pigeon> pigeons; } [DataContract] class Pigeon { [DataMember] public string name; [DataMember] public int numberOfWings; [DataMember] public int age; } Is there an easy way to create an ADO.NET Entity Data Model from this code?

    Read the article

  • Detecting a Lightweight Core Data Migration

    - by hadronzoo
    I'm using Core Data's automatic lightweight migration successfully. However, when a particular entity gets created during a migration, I'd like to populate it with some data. Of course I could check if the entity is empty every time the application starts, but this seems inefficient when Core Data has a migration framework. Is it possible to detect when a lightweight migration occurs (possibly using KVO or notifications), or does this require implementing standard migrations? I've tried using the NSPersistentStoreCoordinatorStoresDidChangeNotification, but it doesn't fire when migrations occur.

    Read the article

  • Getting started with massive data

    - by Max
    I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from? Hadoop/MapReduce is one obvious start. Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?) I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with? Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale. Any other suggestions on things to learn would be great!

    Read the article

  • SSIS - Parallel Execution of Tasks - How efficient is it?

    - by Randy Minder
    I am building an SSIS package that will contain dozens of Sequence tasks. Each Sequence task will contain three tasks. One to truncate a destination table and remove indexes on the table, another to import data from a source table, and a third to add back indexes to the destination table. My question is this. I currently have nine of these Sequences tasks built, and none are dependent on any of the others. When I execute the package, SSIS seems to do a pretty good job of determining which tasks in which Sequence to execute, which, by the way, appears to be quite random. As I continue adding more Sequences, should I attempt to be smarter about how SSIS should execute these Sequences, or is SSIS smart enough to do it itself? Thanks.

    Read the article

  • what is best way to store long term data in iphone Core Data or SQLLite?

    - by AmitSri
    Hi all, I am working on i-Phone app targeting 3.1.3 and later SDK. I want to know the best way to store user's long term data on i-phone without losing performance, consistency and security. I know, that i can use Core Data, PList and SQL-Lite for storing user specific data in custom formats.But, want to know which one is good to use without compromising app performance and scalability in near future. Thanks

    Read the article

< Previous Page | 17 18 19 20 21 22 23 24 25 26 27 28  | Next Page >