Search Results

Search found 2683 results on 108 pages for 'statistical analysis'.

Page 46/108 | < Previous Page | 42 43 44 45 46 47 48 49 50 51 52 53 | Next Page >

Build-time dependency resolving coming to Entity Framework. Now, how about those BI tools too?

- by jamiet

Three months ago I wrote a blog post entitled Some thoughts on Visual Studio database references and how they should be used for SQL Server BI where I shared some thoughts on a feature available to database developers in Visual Studio 2010 that I would love to see added to SQL Server Integration Services (SSIS), Analysis Services (SSAS) and Reporting Services (SSRS). In there I said: Over the past few weeks I have been making heavy use of the Database tools in Visual Studio 2010 and one of the features that has most impressed me has been database references. Database references allow you to have stored procedures in your database project that refer to objects (tables, views, stored procedures etc…) that exist in other database projects and hence when you build your database project it is able to resolve those references. It occurred to me that similar functionality would be incredibly useful for SQL Server Integration Services(SSIS), Analysis Services (SSAS) & Reporting Services (SSRS) projects. After all reports, packages and data source views are rife with references to database objects – why shouldn’t we be able to have design-time dependency checking in our BI projects the same way that database and .Net developers do? In that blog post I shared links to three Connect submissions where I requested this feature be added to SSIS, SSAS & SSRS. In addition I also submitted a request that the feature be extended to .Net projects so that any reference to a database object in a .Net assembly can be resolved at build time. That Connect submission is at [Entity FX] Use database references to constrain the EDM and overnight it received this comment from Microsoft: We have been working on this feature for a while and and will be available soon This is really good news - it improves the Microsoft developer ecosystem by ensuring invalid references to database references get caught at build time (ideally as part of a Continuous integration build) rather than run time. [Hopefully it might nip this code-first nonsense in the bud too (Ooo...way to incite flame comments :) ) ]. If you want to see this feature in action then check out a video from Teched Europe last month entitled SQL Server Developer Tools Code-named "Juneau" where it is demo'd by Lance Delano and Tim Laverty. The point of this blog post though is not just to draw attention to this forthcoming feature for .Net developers, it is to ask you to petition Microsoft to get this feature added to SSIS/SSAS/SSRS too. After all, we already know (from the video above) that the feature is coming to this new code-name Juneau development environment plus we also know that Juneau will be the development environment for SSIS/SSAS/SSRS as well - is it really much of a stretch to expect the BI tools to have access to this great feature too? I don't think so and if you agree with me then I urge you to vote and add a comment to the Connection submissions that are requesting this feature. They are at: [SSAS] Declare Object Dependancies [SSRS] Declare Object Dependancies [SSIS] Declare Object Dependancies (Update, Apparently someone at Microsoft has deemed it necassary to set this to private and I am not able to change it back even though I submitted it. You can still vote on the other two though.) Let's close that SQL Developer Gap! @Jamiet

Read the article
Is Oracle Policy Automation a Fit for My Agency? I'll bet it is.

- by jeffrey.waterman

Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin-top:0in; mso-para-margin-right:0in; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0in; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} Recently, I stumbled upon a new(-ish) whitepaper now posted on the Oracle Technology Network around Oracle Policy Automation (OPA). This paper is certain to become a must read for any customer interested in rules automation. What is OPA? If you are not sitting in your favorite Greek restaurant waiting for that order of Saganaki to appear, OPA is Oracle’s solution for automated streamlining, standardizing, and the maintenance of policy. It is a specialized rules platform that simplifies the automation of rules and policies, putting the analysis in the hands of the analysts, not the IT organization. In other words, OPA allows the organization to be more efficient by eliminating (or at a minimum, reducing the engagement of) the middle man from the process. The whitepaper I mention above is titled, “Is Oracle Policy Automation a Good Fit for My Business?”. This short document walks the reader through use cases and advice for the reader to consider when deciding if OPA is right for their agency. The paper outlines many different scenarios, different uses of OPA in production today and, where OPA may not be a good fit. Many of the use case examples revolve around end user questionnaires or analyst research. What is often overlooked is OPA’s ability to act as a rules engine behind the scenes. That is, take inputs from one source (e.g., personnel data), process that data in OPA and send the output (e.g., pay data with benefits deductions) to a second source. The rules have been automated, no necessary human intervention to perform analysis. A few of my customers have used the embedded OPA solution to improve transaction processing and reduce the time spent analyzing exceptions. I suggest any reader whose organization is reliant on or deals with high complexity, volume or volatility in rules that are based on documentation – or which need to be documented – take a look at Oracle Policy Automation. You can find the white paper on Oracle Technology Network. You can find the white paper in the Oracle Policy Automation of the OTN. You can find more information around OPA on oracle.com. Finally, you can send me a question any time at [email protected] Thank you for reading. If you have any topics around Oracle Applications in the Federal or Public Sector industries you would like to see addressed in this blog, please leave suggestions in the comments section and I will do my best to address in a future post.

Read the article
How much is a subscriber worth?

- by Tom Lewin

This year at Red Gate, we’ve started providing a way to back up SQL Azure databases and Azure storage. We decided to sell this as a service, instead of a product, which means customers only pay for what they use. Unfortunately for us, it makes figuring out revenue much trickier. With a product like SQL Compare, a customer pays for it, and it’s theirs for good. Sure, we offer support and upgrades, but, fundamentally, the sale is a simple, upfront transaction: we’ve made this product, you need this product, we swap product for money and everyone is happy. With software as a service, it isn’t that easy. The money and product don’t change hands up front. Instead, we provide a service in exchange for a recurring fee. We know someone buying SQL Compare will pay us $X, but we don’t know how long service customers will stay with us, or how much they will spend. How do we find this out? We use lifetime value analysis. What is lifetime value? Lifetime value, or LTV, is how much a customer is worth to the business. For Entrepreneurs has a brilliant write up that we followed to conduct our analysis. Basically, it all boils down to this equation: LTV = ARPU x ALC To make it a bit less of an alphabet-soup and a bit more understandable, we can write it out in full: The lifetime value of a customer equals the average revenue per customer per month, times the average time a customer spends with the service Simple, right? A customer is worth the average spend times the average stay. If customers pay on average $50/month, and stay on average for ten months, then a new customer will, on average, bring in $500 over the time they are a customer! Average spend is easy to work out; it’s revenue divided by customers. The problem comes when we realise that we don’t know exactly how long a customer will stay with us. How can we figure out the average lifetime of a customer, if we only have six months’ worth of data? The answer lies in the fact that: Average Lifetime of a Customer = 1 / Churn Rate The churn rate is the percentage of customers that cancel in a month. If half of your customers cancel each month, then your average customer lifetime is two months. The problem we faced was that we didn’t have enough data to make an estimate of one month’s cancellations reliable (because barely anybody cancels)! To deal with this data problem, we can take data from the last three months instead. This means we have more data to play with. We can still use the equation above, we just need to multiply the final result by three (as we worked out how many three month periods customers stay for, and we want our answer to be in months). Now these estimates are likely to be fairly unreliable; when there’s not a lot of data it pays to be cautious with inference. That said, the numbers we have look fairly consistent, and it’s super easy to revise our estimates when new data comes in. At the very least, these numbers give us a vague idea of whether a subscription business is viable. As far as Cloud Services goes, the business looks very viable indeed, and the low cancellation rates are much more than just data points in LTV equations; they show that the product is working out great for our customers, which is exactly what we’re looking for!

Read the article
Advanced Record-Level Business Intelligence with Inner Queries

- by gt0084e1

While business intelligence is generally applied at an aggregate level to large data sets, it's often useful to provide a more streamlined insight into an individual records or to be able to sort and rank them. For instance, a salesperson looking at a specific customer could benefit from basic stats on that account. A marketer trying to define an ideal customer could pull the top entries and look for insights or patterns. Inner queries let you do sophisticated analysis without the overhead of traditional BI or OLAP technologies like Analysis Services. Example - Order History Constancy Let's assume that management has realized that the best thing for our business is to have customers ordering every month. We'll need to identify and rank customers based on how consistently they buy and when their last purchase was so sales & marketing can respond accordingly. Our current application may not be able to provide this and adding an OLAP server like SSAS may be overkill for our needs. Luckily, SQL Server provides the ability to do relatively sophisticated analytics via inner queries. Here's the kind of output we'd like to see. Creating the Queries Before you create a view, you need to create the SQL query that does the calculations. Here we are calculating the total number of orders as well as the number of months since the last order. These fields might be very useful to sort by but may not be available in the app. This approach provides a very streamlined and high performance method of delivering actionable information without radically changing the application. It's also works very well with self-service reporting tools like Izenda. SELECT CustomerID,CompanyName, ( SELECT COUNT(OrderID) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID ) As Orders, DATEDIFF(mm, ( SELECT Max(OrderDate) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID) ,getdate() ) AS MonthsSinceLastOrder FROM Customers Creating Views To turn this or any query into a view, just put CREATE VIEW AS before it. If you want to change it use the statement ALTER VIEW AS. Creating Computed Columns If you'd prefer not to create a view, inner queries can also be applied by using computed columns. Place you SQL in the (Formula) field of the Computed Column Specification or check out this article here. Advanced Scoring and Ranking One of the best uses for this approach is to score leads based on multiple fields. For instance, you may be in a business where customers that don't order every month require more persistent follow up. You could devise a simple formula that shows the continuity of an account. If they ordered every month since their first order, they would be at 100 indicating that they have been ordering 100% of the time. Here's the query that would calculate that. It uses a few SQL tricks to make this happen. We are extracting the count of unique months and then dividing by the months since initial order. This query will give you the following information which can be used to help sales and marketing now where to focus. You could sort by this percentage to know where to start calling or to find patterns describing your best customers. Number of orders First Order Date Last Order Date Percentage of months order was placed since last order. SELECT CustomerID, (SELECT COUNT(OrderID) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID) As Orders, (SELECT Max(OrderDate) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID) AS LastOrder, (SELECT Min(OrderDate) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID) AS FirstOrder, DATEDIFF(mm,(SELECT Min(OrderDate) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID),getdate()) AS MonthsSinceFirstOrder, 100*(SELECT COUNT(DISTINCT 100*DATEPART(yy,OrderDate) + DATEPART(mm,OrderDate)) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID) / DATEDIFF(mm,(SELECT Min(OrderDate) FROM Orders WHERE Orders.CustomerID = Customers.CustomerID),getdate()) As OrderPercent FROM Customers

Read the article
Investigating on xVelocity (VertiPaq) column size

- by Marco Russo (SQLBI)

In January I published an article about how to optimize high cardinality columns in VertiPaq. In the meantime, VertiPaq has been rebranded to xVelocity: the official name is now “xVelocity in-memory analytics engine (VertiPaq)” but using xVelocity and VertiPaq when we talk about Analysis Services has the same meaning. In this post I’ll show how to investigate on columns size of an existing Tabular database so that you can find the most important columns to be optimized. A first approach can be looking in the DataDir of Analysis Services and look for the folder containing the database. Then, look for the biggest files in all subfolders and you will find the name of a file that contains the name of the most expensive column. However, this heuristic process is not very optimized. A better approach is using a DMV that provides the exact information. For example, by using the following query (open SSMS, open an MDX query on the database you are interested to and execute it) you will see all database objects sorted by used size in a descending way. SELECT * FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS ORDER BY used_size DESC You can look at the first rows in order to understand what are the most expensive columns in your tabular model. The interesting data provided are: TABLE_ID: it is the name of the object – it can be also a dictionary or an index COLUMN_ID: it is the column name the object belongs to – you can also see ID_TO_POS and POS_TO_ID in case they refer to internal indexes RECORDS_COUNT: it is the number of rows in the column USED_SIZE: it is the used memory for the object By looking at the ration between USED_SIZE and RECORDS_COUNT you can understand what you can do in order to optimize your tabular model. Your options are: Remove the column. Yes, if it contains data you will never use in a query, simply remove the column from the tabular model Change granularity. If you are tracking time and you included milliseconds but seconds would be enough, round the data source column to the nearest second. If you have a floating point number but two decimals are good enough (i.e. the temperature), round the number to the nearest decimal is relevant to you. Split the column. Create two or more columns that have to be combined together in order to produce the original value. This technique is described in VertiPaq optimization article. Sort the table by that column. When you read the data source, you might consider sorting data by this column, so that the compression will be more efficient. However, this technique works better on columns that don’t have too many distinct values and you will probably move the problem to another column. Sorting data starting from the lower density columns (those with a few number of distinct values) and going to higher density columns (those with high cardinality) is the technique that provides the best compression ratio. After the optimization you should be able to reduce the used size and improve the count/size ration you measured before. If you are interested in a longer discussion about internal storage in VertiPaq and you want understand why this approach can save you space (and time), you can attend my 24 Hours of PASS session “VertiPaq Under the Hood” on March 21 at 08:00 GMT.

Read the article
Comparing Apples and Pairs

- by Tony Davis

A recent study, High Costs and Negative Value of Pair Programming, by Capers Jones, pulls no punches in its assessment of the costs-to- benefits ratio of pair programming, two programmers working together, at a single computer, rather than separately. He implies that pair programming is a method rushed into production on a wave of enthusiasm for Agile or Extreme Programming, without any real regard for its effectiveness. Despite admitting that his data represented a far from complete study of the economics of pair programming, his conclusions were stark: it was 2.5 times more expensive, resulted in a 15% drop in productivity, and offered no significant quality benefits. The author provides a more scientific analysis than Jon Evans’ Pair Programming Considered Harmful, but the theme is the same. In terms of upfront-coding costs, pair programming is surely more expensive. The claim of productivity loss is dubious and contested by other studies. The third claim, though, did surprise me. The author’s data suggests that if both the pair and the individual programmers employ static code analysis and testing, then there is no measurable difference in the resulting code quality, in terms of defects per function point. In other words, pair programming incurs a massive extra cost for no tangible return in investment. There were, inevitably, many criticisms of his data and his conclusions, a few of which are persuasive. Firstly, that the driver/observer model of pair programming, on which the study bases its findings, is far from the most effective. For example, many find Ping-Pong pairing, based on use of test-driven development, far more productive. Secondly, that it doesn’t distinguish between “expert” and “novice” pair programmers– that is, independently of other programming skills, how skilled was an individual at pair programming. Thirdly, that his measure of quality is too narrow. This point rings true, certainly at Red Gate, where developers don’t pair program all the time, but use the method in short bursts, while tackling a tricky problem and needing a fresh perspective on the best approach, or more in-depth knowledge in a particular domain. All of them argue that pair programming, and collective code ownership, offers significant rewards, if not in terms of immediate “bug reduction”, then in removing the likelihood of single points of failure, and improving the overall quality and longer-term adaptability/maintainability of the design. There is also a massive learning benefit for both participants. One developer told me how he once worked in the same team over consecutive summers, the first time with no pair programming and the second time pair-programming two-thirds of the time, and described the increased rate of learning the second time as “phenomenal”. There are a great many theories on how we should develop software (Scrum, XP, Lean, etc.), but woefully little scientific research in their effectiveness. For a group that spends so much time crunching other people’s data, I wonder if developers spend enough time crunching data about themselves. Capers Jones’ data may be incomplete, but should cause a pause for thought, especially for any large IT departments, supporting commerce and industry, who are considering pair programming. It certainly shouldn’t discourage teams from exploring new ways of developing software, as long as they also think about how to gather hard data to gauge their effectiveness.

Read the article
The five steps of business intelligence adoption: where are you?

- by Red Gate Software BI Tools Team

When I was in Orlando and New York last month, I spoke to a lot of business intelligence users. What they told me suggested a path of BI adoption. The user’s place on the path depends on the size and sophistication of their organisation. Step 1: A company with a database of customer transactions will often want to examine particular data, like revenue and unit sales over the last period for each product and territory. To do this, they probably use simple SQL queries or stored procedures to produce data on demand. Step 2: The results from step one are saved in an Excel document, so business users can analyse them with filters or pivot tables. Alternatively, SQL Server Reporting Services (SSRS) might be used to generate a report of the SQL query for display on an intranet page. Step 3: If these queries are run frequently, or business users want to explore data from multiple sources more freely, it may become necessary to create a new database structured for analysis rather than CRUD (create, retrieve, update, and delete). For example, data from more than one system — plus external information — may be incorporated into a data warehouse. This can become ‘one source of truth’ for the business’s operational activities. The warehouse will probably have a simple ‘star’ schema, with fact tables representing the measures to be analysed (e.g. unit sales, revenue) and dimension tables defining how this data is aggregated (e.g. by time, region or product). Reports can be generated from the warehouse with Excel, SSRS or other tools. Step 4: Not too long ago, Microsoft introduced an Excel plug-in, PowerPivot, which allows users to bring larger volumes of data into Excel documents and create links between multiple tables. These BISM Tabular documents can be created by the database owners or other expert Excel users and viewed by anyone with Excel PowerPivot. Sometimes, business users may use PowerPivot to create reports directly from the primary database, bypassing the need for a data warehouse. This can introduce problems when there are misunderstandings of the database structure or no single ‘source of truth’ for key data. Step 5: Steps three or four are often enough to satisfy business intelligence needs, especially if users are sophisticated enough to work with the warehouse in Excel or SSRS. However, sometimes the relationships between data are too complex or the queries which aggregate across periods, regions etc are too slow. In these cases, it can be necessary to formalise how the data is analysed and pre-build some of the aggregations. To do this, a business intelligence professional will typically use SQL Server Analysis Services (SSAS) to create a multidimensional model — or “cube” — that more simply represents key measures and aggregates them across specified dimensions. Step five is where our tool, SSAS Compare, becomes useful, as it helps review and deploy changes from development to production. For us at Red Gate, the primary value of SSAS Compare is to establish a dialog with BI users, so we can develop a portfolio of products that support creation and deployment across a range of report and model types. For example, PowerPivot and the new BISM Tabular model create a potential customer base for tools that extend beyond BI professionals. We’re interested in learning where people are in this story, so we’ve created a six-question survey to find out. Whether you’re at step one or step five, we’d love to know how you use BI so we can decide how to build tools that solve your problems. So if you have a sixty seconds to spare, tell us on the survey!

Read the article
Investigating on xVelocity (VertiPaq) column size

- by Marco Russo (SQLBI)

In January I published an article about how to optimize high cardinality columns in VertiPaq. In the meantime, VertiPaq has been rebranded to xVelocity: the official name is now “xVelocity in-memory analytics engine (VertiPaq)” but using xVelocity and VertiPaq when we talk about Analysis Services has the same meaning. In this post I’ll show how to investigate on columns size of an existing Tabular database so that you can find the most important columns to be optimized. A first approach can be looking in the DataDir of Analysis Services and look for the folder containing the database. Then, look for the biggest files in all subfolders and you will find the name of a file that contains the name of the most expensive column. However, this heuristic process is not very optimized. A better approach is using a DMV that provides the exact information. For example, by using the following query (open SSMS, open an MDX query on the database you are interested to and execute it) you will see all database objects sorted by used size in a descending way. SELECT * FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS ORDER BY used_size DESC You can look at the first rows in order to understand what are the most expensive columns in your tabular model. The interesting data provided are: TABLE_ID: it is the name of the object – it can be also a dictionary or an index COLUMN_ID: it is the column name the object belongs to – you can also see ID_TO_POS and POS_TO_ID in case they refer to internal indexes RECORDS_COUNT: it is the number of rows in the column USED_SIZE: it is the used memory for the object By looking at the ration between USED_SIZE and RECORDS_COUNT you can understand what you can do in order to optimize your tabular model. Your options are: Remove the column. Yes, if it contains data you will never use in a query, simply remove the column from the tabular model Change granularity. If you are tracking time and you included milliseconds but seconds would be enough, round the data source column to the nearest second. If you have a floating point number but two decimals are good enough (i.e. the temperature), round the number to the nearest decimal is relevant to you. Split the column. Create two or more columns that have to be combined together in order to produce the original value. This technique is described in VertiPaq optimization article. Sort the table by that column. When you read the data source, you might consider sorting data by this column, so that the compression will be more efficient. However, this technique works better on columns that don’t have too many distinct values and you will probably move the problem to another column. Sorting data starting from the lower density columns (those with a few number of distinct values) and going to higher density columns (those with high cardinality) is the technique that provides the best compression ratio. After the optimization you should be able to reduce the used size and improve the count/size ration you measured before. If you are interested in a longer discussion about internal storage in VertiPaq and you want understand why this approach can save you space (and time), you can attend my 24 Hours of PASS session “VertiPaq Under the Hood” on March 21 at 08:00 GMT.

Read the article
Synchronizing Asynchronous request handlers in Silverlight environment

- by Eric Lifka

For our senior design project my group is making a Silverlight application that utilizes graph theory concepts and stores the data in a database on the back end. We have a situation where we add a link between two nodes in the graph and upon doing so we run analysis to re-categorize our clusters of nodes. The problem is that this re-categorization is quite complex and involves multiple queries and updates to the database so if multiple instances of it run at once it quickly garbles data and breaks (by trying to re-insert already used primary keys). Essentially it's not thread safe, and we're trying to make it safe, and that's where we're failing and need help :). The create link function looks like this: private Semaphore dblock = new Semaphore(1, 1); // This function is on our service reference and gets called // by the client code. public int addNeed(int nodeOne, int nodeTwo) { dblock.WaitOne(); submitNewNeed(createNewNeed(nodeOne, nodeTwo)); verifyClusters(nodeOne, nodeTwo); dblock.Release(); return 0; } private void verifyClusters(int nodeOne, int nodeTwo) { // Run analysis of nodeOne and nodeTwo in graph } All copies of addNeed should wait for the first one that comes in to finish before another can execute. But instead they all seem to be running and conflicting with each other in the verifyClusters method. One solution would be to force our front end calls to be made synchronously. And in fact, when we do that everything works fine, so the code logic isn't broken. But when it's launched our application will be deployed within a business setting and used by internal IT staff (or at least that's the plan) so we'll have the same problem. We can't force all clients to submit data at different times, so we really need to get it synchronized on the back end. Thanks for any help you can give, I'd be glad to supply any additional information that you could need!

Read the article
OCR: How to improve accuracy - existing libraries for removing non-text 'furniture', shapes, etc to

- by Rob

I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot. Background: I doing this to extract data from a legacy application for use with other applications. This is the only way to get at this data as associated files are in a closed, proprietary, binary format. I will be using AutoItScript to drive the application to show data in its UI, then I will screenshot this and feed this to tesseract. I've already had some success in automating the UI, and have been able to use tesseract to get plain ascii text out of the bitmap. There are several AutoItScripr forum articles discussing its use with tesseract/OCR but not specifically for my question. http://www.autoitscript.com/forum/index.php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2 What I need to do There are thin, 1-pixel wide rectangles that closely enclose some text, when fed to tesseract, it sees them as I for example for a verticle line of the rectangle. Any thoughts on how to remove the rectangles, or best practices? I'm asking if there is a generic command line based toolset to overwrite rectangles, for example, in .png files. I could then pass the .png through this, then pass it to tesseract. Details on the tesseract release/setup I've used are as follows: Go here: http://code.google.com/p/tesseract-ocr/downloads/list - For the basic english generic character set to get Tesseract up and running and recognising your bitmapped text into ascii text, use tesseract-2.00.eng.tar.gz (current version at time of writing is: "English language data for Tesseract (2.00 and up) Jul 2007 989 KB 84845") Related questions I have already looked at on Stack Overflow http://stackoverflow.com/questions/1335581/how-to-give-best-chance-of-success-to-an-ocr-software http://stackoverflow.com/questions/2296568/analysis-and-transformation-of-the-image-on-the-basis-of-this-analysis-for-better http://stackoverflow.com/questions/2268028/reading-characters-off-of-the-screen In these, my question is not completely answered or a commercial solution is being sold. I do not want to consider a commercial solution at this stage.

Read the article
Text mining on large database (data mining)

- by yox

Hello, I have a large database of resumes (CV), and a certain table skills grouping all users skills. inside that table there's a field skill_text that describes the skill in full text. I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills.. Here are some examples skills extracted from the DB : Sectoral and competitive analysis Business Development (incl. in international settings) Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge) Creative work (Photoshop, In-Design, Illustrator) checking and reporting back on campaign progress organising and attending events and exhibitions Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing. The output shoud be something like : Sectoral and competitive analysis Business Development Specific structure and road design software - Macao AutoCAD Photoshop In-Design Illustrator organising events Development Aptana Studio PHP HTML CSS JavaScript SQL AJAX Mix marketing Viral Marketing Social network marketing emailing SEO One to one marketing As you see only skills remains no other representation text. I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ? thanks

Read the article
Agile and Scrum burning me down please help me figuring out the truth

- by jadook

hi all, in the last while I installed MS-TFS 2008 then started to get myself prepared to use Agile Process Guidance template shipped with the TFS. with little googling I passed through Mike Cohn materials: I watched his conference in youtube "sponsored by google: http://www.youtube.com/watch?v=fb9Rzyi8b90 http://www.youtube.com/watch?v=jeT0pOVg0EI Read his book "Agile Estimating and Planning" Watching the video series in his website: http://www.mountaingoatsoftware.com/presentations-tag/video-recorded I was very happy while absorbing and eating the techniques he is using with the teams and how agile and scrum is such a great software process/methodology until I saw Mike answering a question regarding an architect role and talking about the requirements document... at that point everything start falling apart due to the following: Last year I had been assigned to make full analysis "including requirements gathering" for big project "very high priority project". within 2 months of hardwork, dedication and commitment I delivered the whole analysis with full satisfaction of the customer and my BOSS and ZERO amendments. Later on, the project entered the architecting, development ... phases. due to the fact that the system included many competitive and exciting features I requested patenting it and its going in the process... so imagine you are the kind of person who used to love facing all kind of challenges and returning with excellent experience and results for the stakeholders and yourself, How fairly agile and scrum processes will credit and admit your talent and passion while the scrum master/coach treat the team as one unit that accomplish user stories and converge through trial and error approach??!!!! with that dark thoughts about agile and scrum I found many people "anti agile" and on top of them is "Crispin Rogers Johnson": http://agile-crispin.blogspot.com/ that guy made anti statement for everything Mike Cohn used to talk about. I really don't know what to do next! so any guidance will be appreciated. Thanks,

Read the article
rails + compass: advantages vs using haml + blueprint directly

- by egarcia

I've got some experience using haml (+sass) on rails projects. I recently started using them with blueprintcss - the only thing I did was transform blueprint.css into a sass file, and started coding from there. I even have a rails generator that includes all this by default. It seems that Compass does what I do, and other things. I'm trying to understand what those other things are - but the documentation/tutorials weren't very clear. These are my conclusions: Compass comes with built-in sass mixins that implement common CSS idioms, such as links with icons or horizontal lists. My solution doesn't provide anything like that. (1 point for Compass). Compass has several command-line options: you can create a rails project, but you can also "install" it on an existing rails project. A rails generator could be personalized to do the same thing, I guess. (Tie). Compass has two modes of working with blueprint: "basic" and "semantic" usage. I'm not clear about the differences between those. With my rails generator I only have one mode, but it seems enough. (Tie) Apparently, Compass is prepared to use other frameworks, besides blueprint (e.g. YUI). I could not find much documentation about this, and I'm not interested on it anyway - blueprint is ok for me (Tie). Compass' learning curve seems a bit stiff and the documentation seems sparse. Learning could be a bit difficult. On the other hand, I know the ins and outs of my own system and can use it right away. (1 point for my system). With this analysis, I'm hesitant to give Compass a try. Is my analysis correct? Are Am I missing any key points, or have I evaluated any of these points wrongly?

Read the article
structured vs. unstructured data in db

- by Igor

the question is one of design. i'm gathering a big chunk of performance data with lots of key-value pairs. pretty much everything in /proc/cpuinfo, /proc/meminfo/, /proc/loadavg, plus a bunch of other stuff, from several hundred hosts. right now, i just need to display the latest chunk of data in my UI. i will probably end up doing some analysis of the data gathered to figure out performance problems down the road, but this is a new application so i'm not sure what exactly i'm looking for performance-wise just yet. i could structure the data in the db -- have a column for each key i'm gathering. the table would end up being O(100) columns wide, it would be a pain to put into the db, i would have to add new columns if i start gathering a new stat. but it would be easy to sort/analyze the data just using SQL. or i could just dump my unstructured data blob into the table. maybe three columns -- host id, timestamp, and a serialized version of my array, probably using JSON in a TEXT field. which should I do? am i going to be sorry if i go with the unstructured approach? when doing analysis, should i just convert the fields i'm interested in and create a new, more structured table? what are the trade-offs i'm missing here?

Read the article
What is the difference between cubes and the Unified Dimensional Model (if any)?

- by ngm

I'm currently researching SQL Server 2008 as a business intelligence solution, and currently looking at Analysis Services (and I'm pretty new to business intelligence as a whole...) I'm a bit confused by some of the terms in SSAS, particularly the conceptual differences between cubes and MS's Unified Dimensional Model. I believe that a cube in SSAS is basically an OLAP cube -- dimensions, measures, something that sits between the underlying data source and a business user. But then that's kind of what I understand UDM to be as well. The docs for SQL Server 2005 seem to suggest as much: "A cube is essentially synonymous with a Unified Dimensional Model (UDM)". But then the SQL Server 2008 pages sort of suggest that UDM is a wrapper for both multidimensional data (cubes) and relational data: "Use the Unified Dimensional Model to provide one consolidated business view for relational and multidimensional data that includes business entities, business logic, calculations, and metrics." This blog post suggests similarly: "UDM provides a single dimensional model for all OLAP analysis and relational reporting needs. So you can use either MDX or SQL" Is UDM something that sits above cubes? Or are they the same thing? I presume I would develop cubes with the Cube Designer application; what would I develop a UDM with?

Read the article
Is there a disassembler + debugger for java (ala OllyDbg / SoftICE for assembler)?

- by Ran Biron

Is there a utility similar to OllyDbg / SoftICE for java? I.e. execute class (from jar / with class path) and, without source code, show the disassembly of the intermediate code with ability to step through / step over / search for references / edit specific intermediate code in memory / apply edit to file... If not, is it even possible to write something like this (assuming we're willing to live without hotspot for the debug duration)? Edit: I'm not talking about JAD or JD or Cavaj. These are fine decompilers, but I don't want a decompiler for several reasons, most notable is that their output is incorrect (at best, sometimes just plain wrong). I'm not looking for a magical "compiled bytes to java code" - I want to see the actual bytes that are about to be executed. Also, I'd like the ability to change those bytes (just like in an assembly debugger) and, hopefully, write the changed part back to the class file. Edit2: I know javap exists - but it does only one way (and without any sort of analysis). Example (code taken from the vmspec documentation): From java code, we use "javac" to compile this: void setIt(int value) { i = value; } int getIt() { return i; } to a java .class file. Using javap -c I can get this output: Method void setIt(int) 0 aload_0 1 iload_1 2 putfield #4 5 return Method int getIt() 0 aload_0 1 getfield #4 4 ireturn This is OK for the disassembly part (not really good without analysis - "field #4 is Example.i"), but I can't find the two other "tools": A debugger that goes over the instructions themselves (with stack, memory dumps, etc), allowing me to examine the actual code and environment. A way to reverse the process - edit the disassembled code and recreate the .class file (with the edited code).

Read the article
Scripts to parse and download iTunes Connect and AppStore data

- by bradhouse

I'm looking for recommendations of a script or series of scripts that download and parse iTunes Connect sales data and AppStore comments, ratings and rankings data for a defined app. I'm also aware of solutions like: AppViz appsales-mobile iphone-stats Heartbeat.app I'm sure I'll find a few more with more searching. I can't help but feel there must be a really decent set of open source scripts out there to do this, given how many developers are now writing apps for the AppStore. Would be interested to hear any commercial offerings as well (although my personal preference is for open source, so I can at least see what it is doing with my iTunes Connect login credentials). To be clear, I'm really looking for something that hits all of the areas mentioned: App Store (per store) Comments Ratings Category/store rankings iTunes Connect The contents of the sales reports Analysis/graphs of the data is not necessary (but would be a nice to have I guess). I'm not really looking for something like AppSales Mobile above, I would like the raw data so I can do my own analysis and formatting. So far it looks like AppViz (listed above) is the best out there. Any suggestions on what is good/available or should I just go roll my own?

Read the article
Getting started with massive data

- by Max

I'm a math guy and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from? Hadoop/MapReduce is one obvious start. Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?) I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with? Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale. Any other suggestions on things to learn would be great!

Read the article
deadlock because of foreign key?

- by George2

Hello everyone, I am using SQL Server 2008 Enterprise. I met with deadlock in the following store procedure, but because of my fault, I did not record the deadlock graph. But now I can not reproduce deadlock issue. I want to have a postmortem to find the root cause of deadlock to avoid deadlock in the future. The deadlock happens on delete statement. For the delete statement, Param1 is a column of table FooTable, Param1 is a foreign key of another table (refers to another primary key clustered index column of the other table). There is no index on Param1 itself for table FooTable. FooTable has another column which is used as clustered primary key, but not Param1 column. Here is my guess why there is deadlock, and I want to let people review whether my analysis is correct? Since Param1 column has no index, there will be a table scan, and will acquire table level lock, because of foreign key, the delete operation will also need to check master table (e.g. to acquire lock on master table); Some operation on master table acquires master table lock, but want to acquire lock on FooTable; (1) and (2) cause cycle lock which makes deadlock happen. My analysis correct? Any reproduce scenario? create PROCEDURE [dbo].[FooProc] ( @Param1 int ,@Param2 int ,@Param3 int ) AS DELETE FooTable WHERE Param1 = @Param1 INSERT INTO FooTable ( Param1 ,Param2 ,Param3 ) VALUES ( @Param1 ,@Param2 ,@Param3 ) DECLARE @ID bigint SET @ID = ISNULL(@@Identity,-1) IF @ID > 0 BEGIN SELECT IdentityStr FROM FooTable WHERE ID = @ID END thanks in advance, George

Read the article
Simplest way to automatically alter "const" value in Java during compile time

- by Michael Mao

Hi all: This is a question corresponds to my uni assignment so I am very sorry to say I cannot adopt any one of the following best practices in a short time -- you know -- assignment is due tomorrow :( link to Best way to alter const in Java on StackOverflow Basically the only task (I hope so) left for me is the performance tuning. I've got a bunch of predefined "const" values in my single-class agent source code like this: //static final values private static final long FT_THRESHOLD = 400; private static final long FT_THRESHOLD_MARGIN = 50; private static final long FT_SMOOTH_PRICE_IDICATOR = 20; private static final long FT_BUY_PRICE_MODIFIER = 0; private static final long FT_LAST_ROUNDS_STARTTIME = 90; private static final long FT_AMOUNT_ADJUST_MODIFIER = 5; private static final long FT_HISTORY_PIRCES_LENGTH = 10; private static final long FT_TRACK_DURATION = 5; private static final int MAX_BED_BID_NUM_PER_AUC = 12; I can definitely alter the values manually and then compile the code to give it another go around. But the execution time for a thorough "statistic analysis" usually requires over 2000 times of execution, which will typically lasts more than half an hour on my own laptop... So I hope there is a way to alter values using other ways than dig into the source code to change the "const" values there, so I can automatically distributed compiled code to other people's PC and let them run the statistic analysis instead. One other reason for a automatically value adjustment is that I can try using my own agent to defeat itself by choosing different "const" values. Although my values are derived from previous history and statistical results, they are far from the most optimized values. I hope there is a easy way so I can quickly adopt that so to have a good sleep tonight while the computer does everything for me... :) Any hints on this sort of stuff? Any suggestion is welcomed and much appreciated.

Read the article
Is this a good job description? What title would you give this position?

- by Zack Peterson

Department: Information Technology Reports To: Chief Information Officer Purpose: Company's ________________ is specifically engaged in the development of World Wide Web applications and distributed network applications. This person is concerned with all facets of the software development process and specializes in software product management. He or she contributes to projects in an application architect role and also performs individual programming tasks. Essential Duties & Responsibilities: This person is involved in all aspects of the software development process such as: Participation in software product definitions, including requirements analysis and specification Development and refinement of simulations or prototypes to confirm requirements Feasibility and cost-benefit analysis, including the choice of architecture and framework Application and database design Implementation (e.g. installation, configuration, customization, integration, data migration) Authoring of documentation needed by users and partners Testing, including defining/supporting acceptance testing and gathering feedback from pre-release testers Participation in software release and post-release activities, including support for product launch evangelism (e.g. developing demonstrations and/or samples) and subsequent product build/release cycles Maintenance Qualifications: Bachelor's degree in computer science or software engineering Several years of professional programming experience Proficiency in the general technology of the World Wide Web: Hypertext Transfer Protocol (HTTP) Hypertext Markup Language (HTML) JavaScript Cascading Style Sheets (CSS) Proficiency in the following principles, practices, and techniques: Accessibility Interoperability Usability Security (especially prevention of SQL injection and cross-site scripting (XSS) attacks) Object-oriented programming (e.g. encapsulation, inheritance, modularity, polymorphism, etc.) Relational database design (e.g. normalization, orthogonality) Search engine optimization (SEO) Asynchronous JavaScript and XML (AJAX) Proficiency in the following specific technologies utilized by Company: C# or Visual Basic .NET ADO.NET (including ADO.NET Entity Framework) ASP.NET (including ASP.NET MVC Framework) Windows Presentation Foundation (WPF) Language Integrated Query (LINQ) Extensible Application Markup Language (XAML) jQuery Transact-SQL (T-SQL) Microsoft Visual Studio Microsoft Internet Information Services (IIS) Microsoft SQL Server Adobe Photoshop

Read the article
Can ScalaCheck/Specs warnings safely be ignored when using SBT with ScalaTest?

- by pdbartlett

I have a simple FunSuite-based ScalaTest: package pdbartlett.hello_sbt import org.scalatest.FunSuite class SanityTest extends FunSuite { test("a simple test") { assert(true) } test("a very slightly more complicated test - purposely fails") { assert(42 === (6 * 9)) } } Which I'm running with the following SBT project config: import sbt._ class HelloSbtProject(info: ProjectInfo) extends DefaultProject(info) { // Dummy action, just to show config working OK. lazy val solveQ = task { println("42"); None } // Managed dependencies val scalatest = "org.scalatest" % "scalatest" % "1.0" % "test" } However, when I runsbt test I get the following warnings: ... [info] == test-compile == [info] Source analysis: 0 new/modified, 0 indirectly invalidated, 0 removed. [info] Compiling test sources... [info] Nothing to compile. [warn] Could not load superclass 'org.scalacheck.Properties' : java.lang.ClassNotFoundException: org.scalacheck.Properties [warn] Could not load superclass 'org.specs.Specification' : java.lang.ClassNotFoundException: org.specs.Specification [warn] Could not load superclass 'org.specs.Specification' : java.lang.ClassNotFoundException: org.specs.Specification [info] Post-analysis: 3 classes. [info] == test-compile == For the moment I'm assuming these are just "noise" (caused by the unified test interface?) and that I can safely ignore them. But it is slightly annoying to some inner OCD part of me (though not so annoying that I'm prepared to add dependencies for the other frameworks). Is this a correct assumption, or are there subtle errors in my test/config code? If it is safe to ignore, is there any other way to suppress these errors, or do people routinely include all three frameworks so they can pick and choose the best approach for different tests? TIA, Paul. (ADDED: scala v2.7.7 and sbt v0.7.4)

Read the article
R: Are there any alternatives to loops for subsetting from an optimization standpoint?

- by Adam

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing. How I typically do this in R is something like the following: data.mat <- read.csv("...") groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups results <- matrix(rep("NA",300),ncol=3,nrow=100) for(i in 1:100) { tempmat <- subset(data.mat,ID==groupids[i]) #Run various stats on tempmat (correlations, regressions, etc), checking to #make sure this specific group doesn't have NAs in the variables I'm using #and assign results to x, y, and z, for example. results[i,1] <- x results[i,2] <- y results[i,3] <- z } This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days. Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?

Read the article
Using an element against an entire list in Haskell

- by Snick

I have an assignment and am currently caught in one section of what I'm trying to do. Without going in to specific detail here is the basic layout: I'm given a data element, f, that holds four different types inside (each with their own purpose): data F = F Float Int, Int a function: func :: F -> F-> Q Which takes two data elements and (by simple calculations) returns a type that is now an updated version of one of the types in the first f. I now have an entire list of these elements and need to run the given function using one data element and return the type's value (not the data element). My first analysis was to use a foldl function: myfunc :: F -> [F] -> Q myfunc y [] = func y y -- func deals with the same data element calls myfunc y (x:xs) = foldl func y (x:xs) however I keep getting the same error: "Couldn't match expected type 'F' against inferred type 'Q'. In the first argument of 'foldl', namely 'myfunc' In the expression: foldl func y (x:xs) I apologise for such an abstract analysis on my problem but could anyone give me an idea as to what I should do? Should I even use a fold function or is there recursion I'm not thinking about?

Read the article
SPSS - sum of squares change radically with slight model changes in ANOVA??

- by Pat

I have noticed that the sum of squares in my models can change fairly radically with even the slightest adjustment to my models???? Is this normal???? I'm using SPSS 16, and both models presented below used the same data and variables with only one small change - categorizing one of the variables as either a 2 level or 3 level variable. Details - using a 2 x 2 x 6 mixed model ANOVA with the 6 being the repeated measure i get the following in the between group analysis ------------------------------------------------------------ Source | Type III SS | df | MS | F | Sig ------------------------------------------------------------ intercept | 4086.46 | 1 | 4086.46 | 104.93 | .000 X | 224.61 | 1 | 224.61 | 5.77 | .019 Y | 2.60 | 1 | 2.60 | .07 | .80 X by Y | 19.25 | 1 | 19.25 | .49 | .49 Error | 2570.40 | 66 | 38.95 | Then, when I use the exact same data but a slightly different model in which variable Y has 3 levels instead of 2 levels I get the following ------------------------------------------------------------ Source | Type III SS | df | MS | F | Sig ------------------------------------------------------------ intercept | 3603.88 | 1 | 3603.88 | 90.89 | .000 X | 171.89 | 1 | 171.89 | 4.34 | .041 Y | 19.23 | 2 | 9.62 | .24 | .79 X by Y | 17.90 | 2 | 17.90 | .80 | .80 Error | 2537.76 | 64 | 39.65 | I don't understand why variable X would have a different sum of squares simply because variable Y gets devided up into 3 levels instead of 2. This is also the case in the within groups analysis too. Please help me understand :D Thank you in advance Pat

Read the article

Search Results

Search found 2683 results on 108 pages for 'statistical analysis'.

Page 46/108 | < Previous Page | 42 43 44 45 46 47 48 49 50 51 52 53 | Next Page >

- by jamiet

- by jeffrey.waterman

- by Tom Lewin

- by gt0084e1

- by Marco Russo (SQLBI)

- by Tony Davis

- by Red Gate Software BI Tools Team

- by Marco Russo (SQLBI)

- by Eric Lifka

- by Rob

- by yox

- by jadook

- by egarcia

- by Igor

- by ngm

- by Ran Biron

- by bradhouse

- by Max

- by George2

- by Michael Mao

- by Zack Peterson

- by pdbartlett

- by Adam

- by Snick

- by Pat

< Previous Page | 42 43 44 45 46 47 48 49 50 51 52 53 | Next Page >