Search Results

Search found 7473 results on 299 pages for 'usage statistics'.

Page 44/299 | < Previous Page | 40 41 42 43 44 45 46 47 48 49 50 51 | Next Page >

R selecting duplicate rows

- by Matt

Okay, I'm fairly new to R and I've tried to search the documentation for what I need to do but here is the problem. I have a data.frame called heeds.data in the following form (some columns omitted for simplicity) eval.num, eval.count, ... fitness, fitness.mean, green.h.0, green.v.0, offset.0, green.h.1, green.v.1,...green.h.7, green.v.7, offset.7... And I have selected a row meeting the following criteria: best.fitness <- min(heeds.data$fitness.mean[heeds.data$eval.count = 10]) best.row <- heeds.data[heeds.data$fitness.mean == best.fitness] Now, what I want are all of the other rows with that have columns green.h.0 to offset.7 (a contiguous section of columns) equal to the best.row Basically I'm looking for rows that have some of the conditions the same as the "best" row. I thought I could just do this, heeds.best <- heeds.data$fitness[ heeds.data$green.h.0 == best.row$green.h.0 & ... ] But with 24 columns it seems like a stupid method. Looking for something a bit simpler with less manual typing. Thanks!

Read the article
Referal links - how does it work ?

- by oneat

Could you explain me? Because in stats I can find Refelar links and I am curious. How are they made? Is it placed somewhere in HTTP request?

Read the article
Screening (multi)collinearity in a regression model

- by aL3xa

I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors. One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different. Of course, we must always bare in mind specific context/goal of analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

Read the article
incremental way of counting quantiles for large set of data

- by Gacek

I need to count the quantiles for a large set of data. Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile: List<double> allData = new List<double>(); foreach(var row in matrix) // this is only example. In fact the portions of data are not rows of some matrix { allData.AddRange(row); } allData.Sort(); double p = 0.75*allData.Count; int idQ3 = (int)Math.Ceiling(p) - 1; double Q3 = allData[idQ3]; Now, I would like to find a way of counting this without storing the data in some separate variable. The best solution would be to count some parameters od mid-results for first row and then adjust it step by step for next rows. Note: These datasets are really big (ca 5000 elements in each row) The Q3 can be estimated, it doesn't have to be an exact value. I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies! This question is similar to this one: http://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewness But I need to count quantiles. ALso there are few articles in this topic, i.e.: http://web.cs.wpi.edu/~hofri/medsel.pdf http://portal.acm.org/citation.cfm?id=347195&dl But before I would try to implement these, I wanted to ask you if there are maybe any other, qucker ways of counting the 0.25/0.75 quantiles?

Read the article
Naive Bayesian classification (spam filtering) - Doubt in one calculation? Which one is right? Plz c

- by Microkernel

Hi guys, I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question. In this method, you have to calculate P(S|W) - Probability that Message is spam given word W occurs in it. P(W|S) - Probability that word W occurs in a spam message. P(W|H) - Probability that word W occurs in a Ham message. So to calculate P(W|S), should I do (1) (Number of times W occuring in spam)/(total number of times W occurs in all the messages) OR (2) (Number of times word W occurs in Spam)/(Total number of words in the spam message) So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure, so plz clarify me) I am refering http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way. I got to complete the implementation by this weekend :( Thanks and regards, MicroKernel :) @sth: Hmm... Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?. Lets take a scenario and discuss... Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100. And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message. So total number of times W occuring in all the spam message = 5*50 = 250 times. And total number of times W occuring in all Ham messages = 1*50 = 50 times. Total occurance of W in all of the training messages = (250+50) = 300 times. So, in this scenario, how do u calculate P(W|S) and P(W|H) ? Naturally we should expect, P(W|S) P(W|H)??? right. Please share your thought...

Read the article
Any C++ library for Johansen co-integration test ?

- by Faraz

Any Ideas ? Will be highly appreciated.

Read the article
how to develop a program to minimize errors in human transcription of hand written surveys

- by Alex. S.

I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases. I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that. The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected. This question has two parts: The GUI. The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas? The error checking subsystem. What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys? My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way? The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

Read the article
Screening (multi)collinearity in a reggresion model

- by aL3xa

I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors. One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different. Of course, we must always bare in mind specific context/goal of analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.

Read the article
What's the best way to unit test code that generates random output?

- by Flynn1179

Specifically, I've got a method picks n items from a list in such a way that a% of them meet one criterion, and b% meet a second, and so on. A simplified example would be to pick 5 items where 50% have a given property with the value 'true', and 50% 'false'; 50% of the time the method would return 2 true/3 false, and the other 50%, 3 true/2 false. Statistically speaking, this means that over 100 runs, I should get about 250 true/250 false, but because of the randomness, 240/260 is entirely possible. What's the best way to unit test this? I'm assuming that even though technically 300/200 is possible, it should probably fail the test if this happens. Is there a generally accepted tolerance for cases like this, and if so, how do you determine what that is?

Read the article
How do you combine "Revision Control" with "WorkFlow" for R?

- by Tal Galili

Hello all, I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis WorkFlow? Two (very) interesting discussions talk about how to deal with the WorkFlow. But neither of them refer to the revision control element: http://stackoverflow.com/questions/1266279/how-to-organize-large-r-programs http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more. After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches. When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way. But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging). That is way the WorkFlow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical: Which revision control software do you use (and why) ? Which IDE do you use(and why) ? The more interesting question are about work process: How do you structure your files? What do you keep as a separate file and what as a revision? or asking in a different way - What should be a "branch" and what should be a "sub project" in your code? For example: When starting to explore your data, should a plot be creating and then erased because it didn't lead any where (but kept as a revision) or should there be a backup file of that path? How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control? In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable. Thanks a lot, Tal

Read the article
R: Forecast package: Automatic algorithm for composite model involving ETS and AR

- by phanikishan

Hey, I would like to write a code involving automatic selection of a best composite model using ETS as well as autoregressive models. What is the criteria I should base my selection on? Also if I'm using the auto.arima function for deducing number of AR terms and corresponding coefficients from the forecast package in R, does my input series necessarily have to be stationary? or the value for d would be automatically selected thus returning a non-stationary model? Thanks, Phani

Read the article
Significance in R

- by Gemsie

Ok, this is quite hard to explain, but I'm at a complete loss what to do. I'm a relative newcomer to R and although I can completely admire how powerful it is, I'm not too good at actually using it.... Basically, I have some very contrived data that I need to analyse (it wasn't me who chose this, I can assure you!). I have the right and left hand lengths of lots of people, as well as some numeric data that shows their sociability. Now I would like to know if people who have significantly different lengths of hand are more or less sociable than those who have the same (leading into the research that 'symmetrical' people are more sociable and intelligent, etc. I have got as far as loading the data into R, then I have no idea where to go from there. How on Earth do I start to separate those who are close to symmetrical to those who aren't to then start to do the analysis? Ok, using Sasha's great advice, I did the cor.test and got the following: Pearson's product-moment correlation data: measurements$l.hand - measurements$r.hand and measurements$sociable t = 0.2148, df = 150, p-value = 0.8302 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1420623 0.1762437 sample estimates: cor 0.01753501 I have never used this test before, so am unsure how to intepret it...you wouldn't think I was on my fourth Scientific degree would you?! :(

Read the article
Is there a good R API for accessing Google Docs?

- by James Thompson

I'm using R for data analysis, and I'm sharing some data with collaborators via Google docs. Is there a simple interface that I can use to access a R data.frame object to and from a Google Docs spreadsheet? If not, is there a similar API in other languages?

Read the article
Summarising grouped records in a dataframe in R (...again)

- by monch1962

Hello all, (I tried to ask this question earlier today, but later realised I over-simplified the question; the answers I received were correct, but I couldn't use them because of my over-simplification of the problem in the original question. Here's my 2nd attempt...) I have a data frame in R that looks like: "Timestamp", "Source", "Target", "Length", "Content" 0.1 , P1 , P2 , 5 , "ABCDE" 0.2 , P1 , P2 , 3 , "HIJ" 0.4 , P1 , P2 , 4 , "PQRS" 0.5 , P2 , P1 , 2 , "ZY" 0.9 , P2 , P1 , 4 , "SRQP" 1.1 , P1 , P2 , 1 , "B" 1.6 , P1 , P2 , 3 , "DEF" 2.0 , P2 , P1 , 3 , "IJK" ... and I want to convert this to: "StartTime", "EndTime", "Duration", "Source", "Target", "Length", "Content" 0.1 , 0.4 , 0.3 , P1 , P2 , 12 , "ABCDEHIJPQRS" 0.5 , 0.9 , 0.4 , P2 , P1 , 6 , "ZYSRQP" 1.1 , 1.6 , 0.5 , P1 , P2 , 4 , "BDEF" ... Trying to put this into English, I want to group consecutive records with the same 'Source' and 'Target' together, then print out a single record per group showing the StartTime, EndTime & Duration (=EndTime-StartTime) for that group, along with the sum of the Lengths for that group, and a concatenation of the Content (which will all be strings) in that group. The TimeOffset values will always increase throughout the data frame. I had a look at melt/recast and have a feeling that it could be used to solve the problem, but couldn't get my head around the documentation. I suspect it's possible to do this within R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible. Thanks in advance for any assistance you can provide

Read the article
Summarising grouped records in a dataframe in R

- by monch1962

Hello all, I have a data frame in R that looks like this: > TimeOffset, Source, Length > 0 1 1500 > 0.1 1 1000 > 0.2 1 50 > 0.4 2 25 > 0.6 2 3 > 1.1 1 1500 > 1.4 1 18 > 1.6 2 2500 > 1.9 2 18 > 2.1 1 37 > ... and I want to convert it to > TimeOffset, Source, Length > 0.2 1 2550 > 0.6 2 28 > 1.4 1 1518 > 1.9 2 2518 > ... Trying to put this into English, I want to group consecutive records with the same 'Source' together, then printing out a single record per group showing the highest time offset in that group, the source, and the sum of the lengths in that group. The TimeOffset values will always increase. I suspect this is possible in R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible. Thanks in advance for any assistance you can provide

Read the article
Discrete problem of probability theory [closed]

- by calejero

A jury consists of 12 persons each of which has, before the trial started, a probability of 0.4 to vote in favor of the defendant's innocence. During the trial, the lawyer has a probability of 0.6 to change the mind of each juror who was biased against the accused. How likely is the defendant to be acquitted if he needs 10 votes in favor?

Read the article
How popular is Groovy/Grails in the corporate world?

- by Patrick O Connell

Are there any figures for its adoption in corporate environments? Does anyone know of large corporations that have adopted it for projects?

Read the article
Cumulative Normal Distribution function in objective C

- by William Jockusch

Anyone know of a good implementation of this whose license is compatible with non-free iPhone apps? As suggested in this question, Boost looks absolutely wonderful. But as best I can tell, it is only available in C++. http://stackoverflow.com/questions/2328258/cumulative-normal-distribution-function-in-c

Read the article
Generate random number from an arbitrary weighted list

- by Fernando

Here's what I need to do, I'll be doing this both in PHP and JavaScript. I have a list of numbers that will range from 1 to 300-500 (I haven't set the limit yet). I will be running a drawing were 10 numbers will be picked at random from the given range. Here's the tricky part: I want some numbers to be less likely to be drawn up. A small set of those 300-500 will be flagged as "lucky numbers". For example, out of 100 drawings, most numbers have equal chances of being drawn, except for a few, that will only be picked once every 30-50 drawings. Basically I need to artificially set the probability of certain numbers to be picked while maintaining an even distribution with the rest of the numbers. The only similar thing I've found so far is this question: Generate A Weighted Random Number, the problem being that my spec has considerably more numbers (up to 500) so the weights would get very small and supposedly this could be a problem with that solution (Rejection Sampling). I'm still trying it, though, but I wonder if there other solutions. Math is not my thing so I appreciate any input. Thanks.

Read the article
Determining the popularity of a video with ratings and views

- by user295825

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system. Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video? If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad). If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts. So, what I need to do is a combination of the two. Barring, of course, spammy views and votes. What's your guys' thoughts on the subject?

Read the article
Need a R package for piecewise linear regression ?

- by ablimit

Does anybody aware of a package for "piecewise linear regression" ?

Read the article
Suggest a good book for Quantitative Methods & R Programming

- by Rahul

Hi folks, Please suggest a good book for beginner in Quantitative Methods/Techniques. Adding to this, a good book for beginners in R programming language, used in Quantitative Methods. And I've a few questions about this: ? Should I have to learn the other subjects like Probability, Statics, etc. before learning Quantitative Methods ? Is there any relation between Quantitative Methods & Data Mining

Read the article
How to generate correlated binary variables

- by jonalm

Dear All I need to generate a series of N random binary variables with a given correlation function. Let x = {x_i} be a series of binary variables (taking the value 0 or 1, i running form 1 to N). The marginal probability is given Pr(x_i = 1) = p, and the values should be correlated in the following way E[ x_i x_j ] = const * |i-j|^-alfa where alfa is a positive number. Is it possible to generate a series like this? preferably in python.

Read the article
How to change the icon in the title bar in R?

- by Jared

I just installed R 2.11.0-x64 onto my Windows 7 Professional machine. With my previous installations of R (2.10.1 32 bit was the most recent) the little icon that appeared in the title bar and in the taskbar at the bottom of windows was the R "R." Now however, the icon almost looks like a small windows Task Manager. I know this isn't a code issue, but it affects me as I flip between windows. Is there a way to put the "R" icon back in there? Would it be an R setting or a Windows setting?

Read the article
Is NoSQL ideal to store stats?

- by Ivan

I'm not terribly familiar with NoSQL systems, but I remember reading a while back that they are ideal to handle statistical data. Since I'm about to start writing code that will record data like "how many users were registered on each day", I was thinking I could use this as an opportunity to learn more about NoSQL if it fits the bill. If NoSQL is indeed ideal for this, could you provide me with some information as to why? And which specific systems are best suited for this particular need? Thanks!

Read the article

< Previous Page | 40 41 42 43 44 45 46 47 48 49 50 51 | Next Page >