Search Results

Search found 7473 results on 299 pages for 'usage statistics'.

Page 45/299 | < Previous Page | 41 42 43 44 45 46 47 48 49 50 51 52 | Next Page >

Statistical analysis on large data set to be published on the web

- by dassouki

I have a non-computer related data logger, that collects data from the field. This data is stored as text files, and I manually lump the files together and organize them. The current format is through a csv file per year per logger. Each file is around 4,000,000 lines x 7 loggers x 5 years = a lot of data. some of the data is organized as bins item_type, item_class, item_dimension_class, and other data is more unique, such as item_weight, item_color, date_collected, and so on ... Currently, I do statistical analysis on the data using a python/numpy/matplotlib program I wrote. It works fine, but the problem is, I'm the only one who can use it, since it and the data live on my computer. I'd like to publish the data on the web using a postgres db; however, I need to find or implement a statistical tool that'll take a large postgres table, and return statistical results within an adequate time frame. I'm not familiar with python for the web; however, I'm proficient with PHP on the web side, and python on the offline side. users should be allowed to create their own histograms, data analysis. For example, a user can search for all items that are blue shipped between week x and week y, while another user can search for sort the weight distribution of all items by hour for all year long. I was thinking of creating and indexing my own statistical tools, or automate the process somehow to emulate most queries. This seemed inefficient. I'm looking forward to hearing your ideas Thanks

Read the article
Howto Plot "Reverse" Cumulative Frequency Graph With ECDF

- by neversaint

I have no problem plotting the following cumulative frequency graph plot like this. library(Hmisc) pre.test <- rnorm(100,50,10) post.test <- rnorm(100,55,10) x <- c(pre.test, post.test) g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test))) Ecdf(x, group=g, what="f", xlab='Test Results', label.curves=list(keys=1:2)) But I want to show the graph in forms of the "reverse" cumulative frequency of values x ? (i.e. something equivalent to what="1-f"). Is there a way to do it? Other suggestions in R other than using Hmisc are also very much welcomed.

Read the article
what is the best way to track unique visitors?

- by dnkira

hello what i want is to make user counter as true as possible. exluding bots, and clever users as much as possible. as 4 what i know, it can be done in several ways: ip (trouble with dinamic ones and proxy's) cookies (with session id maybe, but can be deleted or browser can be changed) flash cookies (not all users have it) any other ways? and what is the best?

Read the article
R question. Using lappy on a data.frame and creating new variables w/ output.

- by Michael

Hello, I have 13 quantitative variables in a data.frame (called 'UNCA'). The variables are named q01_a, q01_b, ...q01_m. I want to create 13 new variables that have the same values but are coded as a factor. I would like to name these 13 new variables q01_a.F, q01_b.F, ...q01_m.F. Any help would be greatly appreciated!

Read the article
Dimension Reduction in Categorical Data with missing values

- by user227290

I have a regression model in which the dependent variable is continuous but ninety percent of the independent variables are categorical(both ordered and unordered) and around thirty percent of the records have missing values(to make matters worse they are missing randomly without any pattern, that is, more that forty five percent of the data hava at least one missing value). There is no a priori theory to choose the specification of the model so one of the key tasks is dimension reduction before running the regression. While I am aware of several methods for dimension reduction for continuous variables I am not aware of a similar statical literature for categorical data (except, perhaps, as a part of correspondence analysis which is basically a variation of principal component analysis on frequency table). Let me also add that the dataset is of moderate size 500000 observations with 200 variables. I have two questions. Is there a good statistical reference out there for dimension reduction for categorical data along with robust imputation (I think the first issue is imputation and then dimension reduction)? This is linked to implementation of above problem. I have used R extensively earlier and tend to use transcan and impute function heavily for continuous variables and use a variation of tree method to impute categorical values. I have a working knowledge of Python so if something is nice out there for this purpose then I will use it. Any implementation pointers in python or R will be of great help. Thank you.

Read the article
Calculate Spearman's Rank Correlation Coefficient

- by Arvigeus

I'm trying to calculate correlations using this method. However, when I do the math by hand, I get one result, and when I do it with R - cor(dt, use="complete.obs", method="spearman") - I get totally different thing. Example data below: --------------------------------- AnswerPosID.1007 AnswerPosID.1008 --------------------------------- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 --------------------------------- Manual result: 0.799777530589544; R cor: -0.07142857

Read the article
What is the best way to store categorical references in SQL tables?

- by jlafay

I'm wanting to store a wide array of categorical data in MySQL database tables. Let's say that for instance I want to to information on "widgets" and want to categorize attributes in certain ways, i.e. shape category. For instance, the widgets could be classified as: round, square, triangular, spherical, etc. Should these categories be stored within a table to reference them best from an application? Another possibility, I would imagine, would be to add a column to widgets that contained a shape column that contained a tiny int. That way my application could search shapes by that and then use a coordinating enum type that would map the shape int meanings. Which would be best? Or is there another solution that I'm not thinking of yet?

Read the article
What is the relationship between a R.V N(0,1) and others continuous random variables

- by calejero

Hello. I have a question. I need to kwon what is the relationship between a Random Variable with Normal distribuation (N(0,1)) and others continuous random variables. Can you write to me an example? Thank You Hola a todos. Necesito saber qué relación existe entre una variable aleatoria Normal (0,1) y cualquier variable aleatoria continua. Además, me vendría bien un ejemplo. Gracias por vuestra ayuda

Read the article
Determining the chances of an event occurring when it hasn't occurred yet

- by sanity

A user visits my website at time t, and they may or may not click on a particular link I care about, if they do I record the fact that they clicked the link, and also the duration since t that they clicked it, call this d. I need an algorithm that allows me to create a class like this: class ClickProbabilityEstimate { public void reportImpression(long id); public void reportClick(long id); public double estimateClickProbability(long id); } Every impression gets a unique id, and this is used when reporting a click to indicate which impression the click belongs to. I need an algorithm that will return a probability, based on how much time has past since an impression was reported, that the impression will receive a click, based on how long previous clicks required. Clearly one would expect that this probability will decrease over time if there is still no click. If necessary, we can set an upper-bound, beyond which we consider the click probability to be 0 (eg. if its been an hour since the impression occurred, we can be pretty sure there won't be a click). The algorithm should be both space and time efficient, and hopefully make as few assumptions as possible, while being elegant. Ease of implementation would also be nice. Any ideas?

Read the article
Computing, storing, and retrieving values to and from an N-Dimensional matrix

- by Adam S

This question is probably quite different from what you are used to reading here - I hope it can provide a fun challenge. Essentially I have an algorithm that uses 5(or more) variables to compute a single value, called outcome. Now I have to implement this algorithm on an embedded device which has no memory limitations, but has very harsh processing constraints. Because of this, I would like to run a calculation engine which computes outcome for, say, 20 different values of each variable and stores this information in a file. You may think of this as a 5(or more)-dimensional matrix or 5(or more)-dimensional array, each dimension being 20 entries long. In any modern language, filling this array is as simple as having 5(or more) nested for loops. The tricky part is that I need to dump these values into a file that can then be placed onto the embedded device so that the device can use it as a lookup table. The questions now, are: What format(s) might be acceptable for storing the data? What programs (MATLAB, C#, etc) might be best suited to compute the data? C# must be used to import the data on the device - is this possible given your answer to #1?

Read the article
Given a document select a relevant snippet.

- by BCS

When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question? My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words from the SO corpus that have very weak correlation with tags, that is that are equally likely to occur in any question regardless of it's tags)

Read the article
Probability Question

- by Juddling

if i pick 10 numbers from a possible 80. what is the probablity for someone else picking the same 10 numbers as me? or just 4 numbers? or no numbers? i think it's no matches: 10/80 one match: 10*9/80*79 so the formula would be (10!/matches!)/(80!/matches!) is this right? i've only just started doing this at A-level and i need it for a game script i'm making.

Read the article
Howto plot two cumulative frequency graph together

- by neversaint

I have data that looks like this: #val Freq1 Freq2 0.000 178 202 0.001 4611 5300 0.002 99 112 0.003 26 30 0.004 17 20 0.005 15 20 0.006 11 14 0.007 11 13 0.008 13 13 ...many more lines.. Full data can be found here: http://dpaste.com/173536/plain/ What I intend to do is to have a cumulative graph with "val" as x-axis with "Freq1" & "Freq2" as y-axis, plot together in 1 graph. I have this code. But it creates two plots instead of 1. dat <- read.table("stat.txt",header=F); val<-dat$V1 freq1<-dat$V2 freq2<-dat$V3 valf1<-rep(val,freq1) valf2<-rep(val,freq2) valfreq1table<- table(valf1) valfreq2table<- table(valf2) cumfreq1=c(0,cumsum(valfreq1table)) cumfreq2=c(0,cumsum(valfreq2table)) plot(cumfreq1, ylab="CumFreq",xlab="Loglik Ratio") lines(cumfreq1) plot(cumfreq2, ylab="CumFreq",xlab="Loglik Ratio") lines(cumfreq2) What's the right way to approach this?

Read the article
Measuring the limit of a point on a smooth.spline in R

- by Subtle Array

I'm not sure if that's the right terminology. I've entered some data into R, and I've put a smoothingSpline through it using the following command. smoothingSpline = smooth.spline(year, rate, spar=0.35) plot(x,y) lines(smoothingSpline) Now I'd like to measure some limits (or where the curve is at a given y point), and maybe to some predictive analysis on points that extend beyond the graph. Are there commands in R for doing this?

Read the article
How to calculate cointegrations of two lists?

- by Damiano

Hello everybody! Thank you in advance for your help! I have two lists with some stocks prices, example: a = [10.23, 11.65, 12.36, 12.96] b = [5.23, 6.10, 8.3, 4.98] I can calculate the correlation of these two lists, with: import scipy.stats scipy.stats.pearsonr(a, b)[0] But, I didn't found a method to calculate the co-integration of two lists. Could you give me some advices? Thank you very much!

Read the article
How can I neatly clean my R workspace while preserving certain objects?

- by briandk

Suppose I'm messing about with some data by binding vectors together, as I'm wont to do on a lazy sunday afternoon. x <- rnorm(25, mean = 65, sd = 10) y <- rnorm(25, mean = 75, sd = 7) z <- 1:25 dd <- data.frame(mscore = x, vscore = y, caseid = z) I've now got my new dataframe dd, which is wonderful. But there's also still the detritus from my prior slicings and dicings: > ls() [1] "dd" "x" "y" "z" What's a simple way to clean up my workspace if I no longer need my "source" columns, but I want to keep the dataframe? That is, now that I'm done manipulating data I'd like to just have dd and none of the smaller variables that might inadvertently mask further analysis: > ls() [1] "dd" I feel like the solution must be of the form rm(ls[ -(dd) ]) or something, but I can't quite figure out how to say "please clean up everything BUT the following objects."

Read the article
R: Given a set of random numbers drawn from a continuous univariate distribution, find the distribut

- by knorv

Given a set of real numbers drawn from a unknown continuous univariate distribution (let's say is is one of beta, Cauchy, chi-square, exponential, F, gamma, Laplace, log-normal, normal, Pareto, Student's t, uniform and Weibull).. x <- c(15.771062,14.741310,9.081269,11.276436,11.534672,17.980860,13.550017,13.853336,11.262280,11.049087,14.752701,4.481159,11.680758,11.451909,10.001488,11.106817,7.999088,10.591574,8.141551,12.401899,11.215275,13.358770,8.388508,11.875838,3.137448,8.675275,17.381322,12.362328,10.987731,7.600881,14.360674,5.443649,16.024247,11.247233,9.549301,9.709091,13.642511,10.892652,11.760685,11.717966,11.373979,10.543105,10.230631,9.918293,10.565087,8.891209,10.021141,9.152660,10.384917,8.739189,5.554605,8.575793,12.016232,10.862214,4.938752,14.046626,5.279255,11.907347,8.621476,7.933702,10.799049,8.567466,9.914821,7.483575,11.098477,8.033768,10.954300,8.031797,14.288100,9.813787,5.883826,7.829455,9.462013,9.176897,10.153627,4.922607,6.818439,9.480758,8.166601,12.017158,13.279630,14.464876,13.319124,12.331335,3.194438,9.866487,11.337083,8.958164,8.241395,4.289313,5.508243,4.737891,7.577698,9.626720,16.558392,10.309173,11.740863,8.761573,7.099866,10.032640) .. is there some easy way in R to programmatically and automatically find the most likely distribution and the estimated distribution parameters?

Read the article
Cosmic Rays: what is the probability they will affect a program?

- by Mark Harrison

Once again I was in a design review, and encountered the claim that the probability of a particular scenario was "less than the risk of cosmic rays" affecting the program, and it occurred to me that I didn't have the faintest idea what that probability is. What is the probability of a cosmic ray hitting a computer and affecting the execution of the program?

Read the article
Proper use of use of "cor" function in R

- by order

I am interested to know what a proper x (vector matrix or data frame) input looks like. I am currently using the function in two different sorts of matrices. However, I am not sure how R would interpret my data the way I intend. I will explain the types of matrix by example. Type 1 Gene1 Gene2 Gene3 sample1 sample2 Type 2 Sample1 Sample2 Sample3 gene 1 gene 2 gene 3 Are either of these formats valid x parameters? I input both of types of matrices and get some results, but without knowing whether or not this a proper use the function, these are just random numbers. Thank you for your time. I apologize that this isn't more interesting.

Read the article
Which reporting tool would you recommend?

- by grady

Hi, I have some reporting functionality in my app and I want to improve it a bit. Its only SQL in a XML file which is read by some parser. There can be some params for that SQL and when its parsed and the params are injected, it is executed against my DB (SQL Server). I want to improve that a bit so that the results look better and are more flexible. This are the most important points I need to have: subtotals layout that can change dynamically according to settings in DB (like logo, slogan) possibility to use the same report template for several customers (same fields, but different logos,colors, slogans etc.) should run from an ASP.NET application It should be as dynamic as possible. I know of Crystal Reports and the Microsoft Reporting Tool. Are there any others that might be of interest and are my above points possible at all? Thanks for some ideas and hints :-)...

Read the article
Why to use local pipes instead of sockets for programs communication inside one computer?

- by Ole Jak

Why to use local pipes instead of sockets for programs communication inside one computer? Has Any one FRESH stats on who is faster and how nmuch, what is more sequre or less?

Read the article
In R: How do I choose the first 4 Rows of a Data Frame

- by Moe

How Do I go about getting the first 4 rows of my Dataframe: Weight Response 1 Control 59 0.0 2 Treatment 90 0.8 3 Treatment 47 0.1 4 Treamment 106 0.1 5 Control 85 0.7 6 Treatment 73 0.6 7 Control 61 0.2 In 'R'?

Read the article
In R, How to add Row for Information:

- by Moe

Hi, I'm trying to add a Row to my data.frame to spit out the average of the column. This is my data frame: Weight Response 1 Control 59 0.0 2 Treatment 90 0.8 3 Treatment 47 0.1 4 Treamment 106 0.1 5 Control 85 0.7 6 Treatment 73 0.6 7 Control 61 0.2 I'd like it to become: Weight Response 1 Control 59 0.0 2 Treatment 90 0.8 3 Treatment 47 0.1 4 Treamment 106 0.1 5 Control 85 0.7 6 Treatment 73 0.6 7 Control 61 0.2 8 AVERAGES 74 0.3 Thanks!

Read the article
Histrogram matching - image processing - c/c++

- by Raj

Hello I have two histograms. int Hist1[10] = {1,4,3,5,2,5,4,6,3,2}; int Hist1[10] = {1,4,3,15,12,15,4,6,3,2}; Hist1's distribution is of type multi-modal; Hist2's distribution is of type uni-modal with single prominent peak. My questions are Is there any way that i could determine the type of distribution programmatically? How to quantify whether these two histograms are similar/dissimilar? Thanks

Read the article
Pie chart of *nix shell use [closed]

- by hayk.mart

I've encountered a situation where it would be very helpful to know the breakdown of shell use by percentage. For example, I'm looking for something like bash: X%, sh: Y%, csh, tcsh, zsh, ksh, dash, etc.. Obviously, I know there are several complications - multiple shells, the definition of "use", uncertainty and so forth, but I would like to see an informed answer derived from actual data and based on some stated metric, even if the result could be horribly wrong. Bonus if there is historical data demonstrating a shift in preferences.

Read the article

< Previous Page | 41 42 43 44 45 46 47 48 49 50 51 52 | Next Page >