Search Results

Search found 1692 results on 68 pages for 'trending and statistics'.

Page 15/68 | < Previous Page | 11 12 13 14 15 16 17 18 19 20 21 22  | Next Page >

  • Screening (multi)collinearity in a reggresion model

    - by aL3xa
    I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors. One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different. Of course, we must always bare in mind specific context/goal of analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

    Read the article

  • how to develop a program to minimize errors in human transcription of hand written surveys

    - by Alex. S.
    I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases. I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that. The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected. This question has two parts: The GUI. The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas? The error checking subsystem. What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys? My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way? The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

    Read the article

  • incremental way of counting quantiles for large set of data

    - by Gacek
    I need to count the quantiles for a large set of data. Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile: List<double> allData = new List<double>(); foreach(var row in matrix) // this is only example. In fact the portions of data are not rows of some matrix { allData.AddRange(row); } allData.Sort(); double p = 0.75*allData.Count; int idQ3 = (int)Math.Ceiling(p) - 1; double Q3 = allData[idQ3]; Now, I would like to find a way of counting this without storing the data in some separate variable. The best solution would be to count some parameters od mid-results for first row and then adjust it step by step for next rows. Note: These datasets are really big (ca 5000 elements in each row) The Q3 can be estimated, it doesn't have to be an exact value. I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies! This question is similar to this one: http://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewness But I need to count quantiles. ALso there are few articles in this topic, i.e.: http://web.cs.wpi.edu/~hofri/medsel.pdf http://portal.acm.org/citation.cfm?id=347195&dl But before I would try to implement these, I wanted to ask you if there are maybe any other, qucker ways of counting the 0.25/0.75 quantiles?

    Read the article

  • How do you combine "Revision Control" with "WorkFlow" for R?

    - by Tal Galili
    Hello all, I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis WorkFlow? Two (very) interesting discussions talk about how to deal with the WorkFlow. But neither of them refer to the revision control element: http://stackoverflow.com/questions/1266279/how-to-organize-large-r-programs http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more. After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches. When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way. But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging). That is way the WorkFlow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical: Which revision control software do you use (and why) ? Which IDE do you use(and why) ? The more interesting question are about work process: How do you structure your files? What do you keep as a separate file and what as a revision? or asking in a different way - What should be a "branch" and what should be a "sub project" in your code? For example: When starting to explore your data, should a plot be creating and then erased because it didn't lead any where (but kept as a revision) or should there be a backup file of that path? How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control? In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable. Thanks a lot, Tal

    Read the article

  • Significance in R

    - by Gemsie
    Ok, this is quite hard to explain, but I'm at a complete loss what to do. I'm a relative newcomer to R and although I can completely admire how powerful it is, I'm not too good at actually using it.... Basically, I have some very contrived data that I need to analyse (it wasn't me who chose this, I can assure you!). I have the right and left hand lengths of lots of people, as well as some numeric data that shows their sociability. Now I would like to know if people who have significantly different lengths of hand are more or less sociable than those who have the same (leading into the research that 'symmetrical' people are more sociable and intelligent, etc. I have got as far as loading the data into R, then I have no idea where to go from there. How on Earth do I start to separate those who are close to symmetrical to those who aren't to then start to do the analysis? Ok, using Sasha's great advice, I did the cor.test and got the following: Pearson's product-moment correlation data: measurements$l.hand - measurements$r.hand and measurements$sociable t = 0.2148, df = 150, p-value = 0.8302 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1420623 0.1762437 sample estimates: cor 0.01753501 I have never used this test before, so am unsure how to intepret it...you wouldn't think I was on my fourth Scientific degree would you?! :(

    Read the article

  • R: Forecast package: Automatic algorithm for composite model involving ETS and AR

    - by phanikishan
    Hey, I would like to write a code involving automatic selection of a best composite model using ETS as well as autoregressive models. What is the criteria I should base my selection on? Also if I'm using the auto.arima function for deducing number of AR terms and corresponding coefficients from the forecast package in R, does my input series necessarily have to be stationary? or the value for d would be automatically selected thus returning a non-stationary model? Thanks, Phani

    Read the article

  • What's the best way to unit test code that generates random output?

    - by Flynn1179
    Specifically, I've got a method picks n items from a list in such a way that a% of them meet one criterion, and b% meet a second, and so on. A simplified example would be to pick 5 items where 50% have a given property with the value 'true', and 50% 'false'; 50% of the time the method would return 2 true/3 false, and the other 50%, 3 true/2 false. Statistically speaking, this means that over 100 runs, I should get about 250 true/250 false, but because of the randomness, 240/260 is entirely possible. What's the best way to unit test this? I'm assuming that even though technically 300/200 is possible, it should probably fail the test if this happens. Is there a generally accepted tolerance for cases like this, and if so, how do you determine what that is?

    Read the article

  • Discrete problem of probability theory [closed]

    - by calejero
    A jury consists of 12 persons each of which has, before the trial started, a probability of 0.4 to vote in favor of the defendant's innocence. During the trial, the lawyer has a probability of 0.6 to change the mind of each juror who was biased against the accused. How likely is the defendant to be acquitted if he needs 10 votes in favor?

    Read the article

  • Summarising grouped records in a dataframe in R

    - by monch1962
    Hello all, I have a data frame in R that looks like this: > TimeOffset, Source, Length > 0 1 1500 > 0.1 1 1000 > 0.2 1 50 > 0.4 2 25 > 0.6 2 3 > 1.1 1 1500 > 1.4 1 18 > 1.6 2 2500 > 1.9 2 18 > 2.1 1 37 > ... and I want to convert it to > TimeOffset, Source, Length > 0.2 1 2550 > 0.6 2 28 > 1.4 1 1518 > 1.9 2 2518 > ... Trying to put this into English, I want to group consecutive records with the same 'Source' together, then printing out a single record per group showing the highest time offset in that group, the source, and the sum of the lengths in that group. The TimeOffset values will always increase. I suspect this is possible in R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible. Thanks in advance for any assistance you can provide

    Read the article

  • Summarising grouped records in a dataframe in R (...again)

    - by monch1962
    Hello all, (I tried to ask this question earlier today, but later realised I over-simplified the question; the answers I received were correct, but I couldn't use them because of my over-simplification of the problem in the original question. Here's my 2nd attempt...) I have a data frame in R that looks like: "Timestamp", "Source", "Target", "Length", "Content" 0.1 , P1 , P2 , 5 , "ABCDE" 0.2 , P1 , P2 , 3 , "HIJ" 0.4 , P1 , P2 , 4 , "PQRS" 0.5 , P2 , P1 , 2 , "ZY" 0.9 , P2 , P1 , 4 , "SRQP" 1.1 , P1 , P2 , 1 , "B" 1.6 , P1 , P2 , 3 , "DEF" 2.0 , P2 , P1 , 3 , "IJK" ... and I want to convert this to: "StartTime", "EndTime", "Duration", "Source", "Target", "Length", "Content" 0.1 , 0.4 , 0.3 , P1 , P2 , 12 , "ABCDEHIJPQRS" 0.5 , 0.9 , 0.4 , P2 , P1 , 6 , "ZYSRQP" 1.1 , 1.6 , 0.5 , P1 , P2 , 4 , "BDEF" ... Trying to put this into English, I want to group consecutive records with the same 'Source' and 'Target' together, then print out a single record per group showing the StartTime, EndTime & Duration (=EndTime-StartTime) for that group, along with the sum of the Lengths for that group, and a concatenation of the Content (which will all be strings) in that group. The TimeOffset values will always increase throughout the data frame. I had a look at melt/recast and have a feeling that it could be used to solve the problem, but couldn't get my head around the documentation. I suspect it's possible to do this within R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible. Thanks in advance for any assistance you can provide

    Read the article

  • Determining the popularity of a video with ratings and views

    - by user295825
    I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system. Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video? If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad). If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts. So, what I need to do is a combination of the two. Barring, of course, spammy views and votes. What's your guys' thoughts on the subject?

    Read the article

  • Generate random number from an arbitrary weighted list

    - by Fernando
    Here's what I need to do, I'll be doing this both in PHP and JavaScript. I have a list of numbers that will range from 1 to 300-500 (I haven't set the limit yet). I will be running a drawing were 10 numbers will be picked at random from the given range. Here's the tricky part: I want some numbers to be less likely to be drawn up. A small set of those 300-500 will be flagged as "lucky numbers". For example, out of 100 drawings, most numbers have equal chances of being drawn, except for a few, that will only be picked once every 30-50 drawings. Basically I need to artificially set the probability of certain numbers to be picked while maintaining an even distribution with the rest of the numbers. The only similar thing I've found so far is this question: Generate A Weighted Random Number, the problem being that my spec has considerably more numbers (up to 500) so the weights would get very small and supposedly this could be a problem with that solution (Rejection Sampling). I'm still trying it, though, but I wonder if there other solutions. Math is not my thing so I appreciate any input. Thanks.

    Read the article

  • How to change the icon in the title bar in R?

    - by Jared
    I just installed R 2.11.0-x64 onto my Windows 7 Professional machine. With my previous installations of R (2.10.1 32 bit was the most recent) the little icon that appeared in the title bar and in the taskbar at the bottom of windows was the R "R." Now however, the icon almost looks like a small windows Task Manager. I know this isn't a code issue, but it affects me as I flip between windows. Is there a way to put the "R" icon back in there? Would it be an R setting or a Windows setting?

    Read the article

  • What percent of web sites use JavaScript?

    - by Claudiu
    I'm wondering just how pervasive JavaScript is. This article states that 73% of websites they tested rely on JavaScript for important functionality, but it seems to me that the number must be larger. Have any surveys been done on this topic? Maybe a better way to phrase this question is - are there any sites that don't use JavaScript? EDIT: By 'use', I don't necessarily mean "rely on for important functionality" - that was just the statistic that one article gave.

    Read the article

  • How to generate correlated binary variables

    - by jonalm
    Dear All I need to generate a series of N random binary variables with a given correlation function. Let x = {x_i} be a series of binary variables (taking the value 0 or 1, i running form 1 to N). The marginal probability is given Pr(x_i = 1) = p, and the values should be correlated in the following way E[ x_i x_j ] = const * |i-j|^-alfa where alfa is a positive number. Is it possible to generate a series like this? preferably in python.

    Read the article

  • Is NoSQL ideal to store stats?

    - by Ivan
    I'm not terribly familiar with NoSQL systems, but I remember reading a while back that they are ideal to handle statistical data. Since I'm about to start writing code that will record data like "how many users were registered on each day", I was thinking I could use this as an opportunity to learn more about NoSQL if it fits the bill. If NoSQL is indeed ideal for this, could you provide me with some information as to why? And which specific systems are best suited for this particular need? Thanks!

    Read the article

  • Suggest a good book for Quantitative Methods & R Programming

    - by Rahul
    Hi folks, Please suggest a good book for beginner in Quantitative Methods/Techniques. Adding to this, a good book for beginners in R programming language, used in Quantitative Methods. And I've a few questions about this: ? Should I have to learn the other subjects like Probability, Statics, etc. before learning Quantitative Methods ? Is there any relation between Quantitative Methods & Data Mining

    Read the article

  • Statistical analysis on large data set to be published on the web

    - by dassouki
    I have a non-computer related data logger, that collects data from the field. This data is stored as text files, and I manually lump the files together and organize them. The current format is through a csv file per year per logger. Each file is around 4,000,000 lines x 7 loggers x 5 years = a lot of data. some of the data is organized as bins item_type, item_class, item_dimension_class, and other data is more unique, such as item_weight, item_color, date_collected, and so on ... Currently, I do statistical analysis on the data using a python/numpy/matplotlib program I wrote. It works fine, but the problem is, I'm the only one who can use it, since it and the data live on my computer. I'd like to publish the data on the web using a postgres db; however, I need to find or implement a statistical tool that'll take a large postgres table, and return statistical results within an adequate time frame. I'm not familiar with python for the web; however, I'm proficient with PHP on the web side, and python on the offline side. users should be allowed to create their own histograms, data analysis. For example, a user can search for all items that are blue shipped between week x and week y, while another user can search for sort the weight distribution of all items by hour for all year long. I was thinking of creating and indexing my own statistical tools, or automate the process somehow to emulate most queries. This seemed inefficient. I'm looking forward to hearing your ideas Thanks

    Read the article

  • Howto Plot "Reverse" Cumulative Frequency Graph With ECDF

    - by neversaint
    I have no problem plotting the following cumulative frequency graph plot like this. library(Hmisc) pre.test <- rnorm(100,50,10) post.test <- rnorm(100,55,10) x <- c(pre.test, post.test) g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test))) Ecdf(x, group=g, what="f", xlab='Test Results', label.curves=list(keys=1:2)) But I want to show the graph in forms of the "reverse" cumulative frequency of values x ? (i.e. something equivalent to what="1-f"). Is there a way to do it? Other suggestions in R other than using Hmisc are also very much welcomed.

    Read the article

  • Calculate Spearman's Rank Correlation Coefficient

    - by Arvigeus
    I'm trying to calculate correlations using this method. However, when I do the math by hand, I get one result, and when I do it with R - cor(dt, use="complete.obs", method="spearman") - I get totally different thing. Example data below: --------------------------------- AnswerPosID.1007 AnswerPosID.1008 --------------------------------- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 --------------------------------- Manual result: 0.799777530589544; R cor: -0.07142857

    Read the article

< Previous Page | 11 12 13 14 15 16 17 18 19 20 21 22  | Next Page >