Search Results

Search found 1692 results on 68 pages for 'trending and statistics'.

Page 12/68 | < Previous Page | 8 9 10 11 12 13 14 15 16 17 18 19  | Next Page >

  • Modeling distribution of performance measurements

    - by peterchen
    How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario? My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty). My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized. A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers. I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;) Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately. Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle. Why (since it was asked) - For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc. This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above. [edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.

    Read the article

  • How can I calculate a trend line in PHP?

    - by Stephen
    So I've read the two related questions for calculating a trend line for a graph, but I'm still lost. I have an array of xy coordinates, and I want to come up with another array of xy coordinates (can be fewer coordinates) that represent a logarithmic trend line using PHP. I'm passing these arrays to javascript to plot graphs on the client side.

    Read the article

  • Do I need to recreate statistics if I had to drop them to add a foreign key

    - by Adam J.R. Erickson
    I have a database which had all it's foreign-key relationships dropped at some unknown time in the past (don't ask). I have an old copy of the database which isn't good to restore from, but the schema has the relationships. I'm working from that to create a script to restore the keys. In updating the tables, I've had to drop statistics from several tables. Do I need to manually recreate those, or can I just run the statistics update procedure when all the tables are updated?

    Read the article

  • How reliable is the battery data in ubuntu the power statistics?

    - by nbubis
    Right now the power statistics show that: Energy when full: 25.5 Wh Energy (design): 93.2 Wh And indeed the battery doesn't seem to be lasting as long as it used too. My question: Is this data reliable? Does it really indicate that I should replace the battery, or could it be the charger, laptop, or OS that is stopping the battery from fully charging? Is any way of validating the battery is indeed to blame? I'd like to be sure before shelling out 90$ for a new battery. (If it helps, the battery is a 3 year old dell 9 cell rated at 90 Wh).

    Read the article

  • Are there statistics or time series of open bugs in Ubuntu?

    - by aroque
    I would like to know how the number of bugs in Ubuntu (open, closed, critical, etc) has evolved with time. It's a sort of scientific curiosity I have, but it would also give me a feeling how the community has changed over time, how it has coped with the challenges (I think of Unity in particular) and what's its status now. Has anyone collected these data over the years? If yes, are they publicly available? I know this information can be gathered from Launchpad itself and actually I found a website that had data from mid 2008 to early 2009. I found Ubuntu live stats, which shows live messages related to Ubuntu, but does not aggregate bug statistics. Finally there are some stats on the Ubuntu Weekly Newsletter but they only show diffs of bugs closed during the last week.

    Read the article

  • Do you want to find out when statistics have been updated?

    - by simonsabin
    If you need to find out when statistics were last updated run the following select OBJECT_NAME ( s . object_id ) object    , s . name    , STATS_DATE ( s . object_id , s . stats_id ) StatsDate    , s . auto_created    , s . filter_definition    , s . has_filter    , s . no_recompute    , s . user_created    , stuff (( select ',' + col . name      from sys . stats_columns sc    join sys...(read more)

    Read the article

  • How to get the trending tags using php and mysql?

    - by Tom
    Hello I have a table : tags(tagname,entryid,stamp) and i want to make a section for the most trending tags today, the tagname column has no unique value, because many entries has the same tag, so the php code that i want should display the most attached tags today. Note: the "stamp" column is the date of adding the tag in UNIX time stamp format. Thanks

    Read the article

  • Graphing perpendicular offsets in a least squares regression plot in R

    - by D W
    I'm interested in making a plot with a least squares regression line and line segments connecting the datapoints to the regression line as illustrated here in the graphic called perpendicular offsets: http://mathworld.wolfram.com/LeastSquaresFitting.html I have the plot and regression line done here: ## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html ## Disease severity as a function of temperature # Response variable, disease severity diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4) # Predictor variable, (Centigrade) temperature<-c(2,1,5,5,20,20,23,10,30,25) ## Fit a linear model for the data and summarize the output from function lm() severity.lm <- lm(diseasesev~temperature,data=severity) # Take a look at the data plot( diseasesev~temperature, data=severity, xlab="Temperature", ylab="% Disease Severity", pch=16 ) abline(severity.lm,lty=1) title(main="Graph of % Disease Severity vs Temperature") Should I use some kind of for loop and segments http://www.iiap.res.in/astrostat/School07/R/html/graphics/html/segments.html to do the perpendicular offsets? Is there a more efficient way? Please provide an example if possible.

    Read the article

  • Reordering matrix elements to reflect column and row clustering in naiive python

    - by bgbg
    Hello, I'm looking for a way to perform clustering separately on matrix rows and than on its columns, reorder the data in the matrix to reflect the clustering and putting it all together. The clustering problem is easily solvable, so is the dendrogram creation (for example in this blog or in "Programming collective intelligence"). However, how to reorder the data remains unclear for me. Eventually, I'm looking for a way of creating graphs similar to the one below using naive Python (with any "standard" library such as numpy, matplotlib etc, but without using R or other external tools).

    Read the article

  • How do I determine a best-fit distribution in java?

    - by Eadwacer
    I have a bunch of sets of data (between 50 to 500 points, each of which can take a positive integral value) and need to determine which distribution best describes them. I have done this manually for several of them, but need to automate this going forward. Some of the sets are completely modal (every datum has the value of 15), some are strongly modal or bimodal, some are bell-curves (often skewed and with differing degrees of kertosis/pointiness), some are roughly flat, and there are any number of other possible distributions (possion, power-law, etc.). I need a way to determine which distribution best describes the data and (ideally) also provides me with a fitness metric so that I know how confident I am in the analysis. Existing open-source libraries would be ideal, followed by well documented algorithms that I can implement myself.

    Read the article

  • naive bayesian spam filter question

    - by Microkernel
    Hi guys, I am planning to implement spam filter using Naive Bayesian classification model. Online I see a lot of info on Naive Bayesian classification, but the problem is its a lot of mathematical stuff, than clearly stating how its done. And the problem is I am more of a programmer than a mathematician (yes I had learnt Probability and Bayesian theorem back in school, but out of touch for a long long time, and I don't have luxury of learning it now (Have nearly 3 weeks to come-up with a working prototype)). So if someone can explain or point me to location where its explained for programmers than a mathematician, it would be a great help. PS: By the way I have to implement it in C, if you want to know. :( Regards, Microkernel

    Read the article

  • Question with R. Element wise multiplication, addition, and division with 2 data.frames with varying

    - by Michael
    I have a various data.frames with columns of the same length where I am trying to multiple 2 rows together element-wise and then sum this up. For example, below are two vectors I would like to perform this operation with. > a.1[186,] q01_a q01_b q01_c q01_d q01_e q01_f q01_g q01_h q01_i q01_j q01_k q01_l q01_m 3 3 3 3 2 2 2 3 1 NA NA 2 2 and > u.1[186,] q04_avl_a q04_avl_b q04_avl_c q04_avl_d q04_avl_e q04_avl_f q04_avl_g q04_avl_h q04_avl_i q04_avl_j q04_avl_k q04_avl_l q04_avl_m 4 2 3 4 3 4 4 4 3 4 3 3 3` The issue is that various rows have varying numbers of NA's. What I would like to do is skip the multiplication with any missing values ( the 10th and 11th position from my above example), and then after the addition divide by the number of elements that were multiplied (11 from the above example). Most rows are complete and would just be multiplied by 13. Thank you!

    Read the article

  • Plot 2 graphs in same plot in R?

    - by Sandra Schlichting
    I would like to plot y1 and y2 in the same plot. x <- seq(-2, 2, 0.05) y1 <- pnorm(x) y2 <- pnorm(x,1,1) plot(x,y1,type="l",col="red") plot(x,y2,type="l",col="green") But when I do it like this, they are not plotted in the same plot together. In Matlab can one do hold on, but does anyone know how to do this in R?

    Read the article

  • simple Stata program

    - by Cyrus S
    I am trying to write a simple program to combine coefficient and standard error estimates from a set of regression fits. I run, say, 5 regressions, and store the coefficient(s) and standard error(s) of interest into vectors (Stata matrix objects, actually). Then, I need to do the following: Find the mean value of the coefficient estimates. Combine the standard error estimates according to the formula suggested for combining results from "multiple imputation". The formula is the square root of the formula for "T" on page 6 of the following document: http://bit.ly/b05WX3 I have written Stata code that does this once, but I want to write this as a function (or "program", in Stata speak) that takes as arguments the vector (or matrix, if possible, to combine multiple estimates at once) of regression coefficient estimates and the vector (or matrix) of corresponding standard error estimates, and then generates 1 and 2 above. Here is the code that I wrote: (breg is a 1x5 vector of the regression coefficient estimates, and sereg is a 1x5 vector of the associated standard error estimates) mat ones = (1,1,1,1,1) mat bregmean = (1/5)*(ones*breg’) scalar bregmean_s = bregmean[1,1] mat seregmean = (1/5)*(ones*sereg’) mat seregbtv = (1/4)*(breg - bregmean#ones)* (breg - bregmean#ones)’ mat varregmi = (1/5)*(sereg*sereg’) + (1+(1/5))* seregbtv scalar varregmi_s = varregmi[1,1] scalar seregmi = sqrt(varregmi_s) disp bregmean_s disp seregmi This gives the right answer for a single instance. Any pointers would be great! UPDATE: I completed the code for combining estimates in a kXm matrix of coefficients/parameters (k is the number of parameters, m the number of imputations). Code can be found here: http://bit.ly/cXJRw1 Thanks to Tristan and Gabi for the pointers.

    Read the article

  • Fitting Gaussian KDE in numpy/scipy in Python

    - by user248237
    I am fitting a Gaussian kernel density estimator to a variable that is the difference of two vectors, called "diff", as follows: gaussian_kde_covfact(diff, smoothing_param) -- where gaussian_kde_covfact is defined as: class gaussian_kde_covfact(stats.gaussian_kde): def __init__(self, dataset, covfact = 'scotts'): self.covfact = covfact scipy.stats.gaussian_kde.__init__(self, dataset) def _compute_covariance_(self): '''not used''' self.inv_cov = np.linalg.inv(self.covariance) self._norm_factor = sqrt(np.linalg.det(2*np.pi*self.covariance)) * self.n def covariance_factor(self): if self.covfact in ['sc', 'scotts']: return self.scotts_factor() if self.covfact in ['si', 'silverman']: return self.silverman_factor() elif self.covfact: return float(self.covfact) else: raise ValueError, \ 'covariance factor has to be scotts, silverman or a number' def reset_covfact(self, covfact): self.covfact = covfact self.covariance_factor() self._compute_covariance() This works, but there is an edge case where the diff is a vector of all 0s. In that case, I get the error: File "/srv/pkg/python/python-packages/python26/scipy/scipy-0.7.1/lib/python2.6/site-packages/scipy/stats/kde.py", line 334, in _compute_covariance self.inv_cov = linalg.inv(self.covariance) File "/srv/pkg/python/python-packages/python26/scipy/scipy-0.7.1/lib/python2.6/site-packages/scipy/linalg/basic.py", line 382, in inv if info>0: raise LinAlgError, "singular matrix" numpy.linalg.linalg.LinAlgError: singular matrix What's a way to get around this? In this case, I'd like it to return a density that's essentially peaked completely at a difference of 0, with no mass everywhere else. thanks.

    Read the article

  • Where can I find free and open data?

    - by kitsune
    Sooner or later, coders will feel the need to have access to "open data" in one of their projects, from knowing a city's zip to a more obscure information such as the axial tilt of Pluto. I know data.un.org which offers access to the UN's extensive array of databases that deal with human development and other socio-economic issues. The other usual suspects are NASA and the USGS for planetary data. There's an article at readwriteweb with more links. infochimps.org seems to stand out. Personally, I need to find historic commodity prices, stock values and other financial data. All these data sets seem to cost money however. Clarification To clarify, I'm interested in all kinds of open data, because sooner or later, I know I will be in a situation where I could need it. I will try to edit this answer and include the suggestions in a structured manners. A link for financial data was hidden in that readwriteweb article, doh! It's called opentick.com. Looks good so far! Update I stumbled over semantic data in another question of mine on here. There is opencyc ('the world's largest and most complete general knowledge base and commonsense reasoning engine'). A project called UMBEL provides a light-weight, distilled version of opencyc. Umbel has semantic data in rdf/owl/skos n3 syntax. The Worldbank also released a very nice API. It offers data from the last 50 years for about 200 countries

    Read the article

  • How to know if two stocks move togheter?

    - by Damiano
    Hello, I have two stocks with their prices, example: STOCK1: 10.56 11.23 12.32 8.90 STOCK2: 1.26 5.80 3.26 10.3 I only found Pearson correlation, but, is there another method to know if two stocks move togheter? (esample: co-integration??) Thank you so much!

    Read the article

  • How to get statistical distributions out of C++ Code?

    - by Bader
    I want some help in programming a random generator for different types of distribution using C++ language. for the following: Geometric distribution Hypergeometric distribution Weibull distribution Rayleigh distribution Erlang distribution Gamma distribution Poisson distribution Thanks.

    Read the article

  • Latent Dirichlet Allocation, pitfalls, tips and programs

    - by Gregg Lind
    I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice. Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus are really in the same category, like all articles by the same author. Can I add that into the analysis? Any unexpected pitfalls or tips I should know before embarking? I'd prefer is there are R or Python front ends for whatever program, but I expect (and accept) that I'll be dealing with C.

    Read the article

  • science.js’s loess() output is identical to input

    - by user3710111
    Rendered project available here. The line is supposed to be a trend line (as rendered with LOESS), but it merely follows each data point instead. I am no stats wonk, so maybe it makes sense that a LOESS function’s output would match the input as seen in the above example, but it strikes me as being wrong. Here is the relevant bit of code: var loess = science.stats.loess().bandwidth(.2); var xVal = data.map(function(d) { return d.date; }); var yVal = data.map(function(d) { return d.A; }); var loessData = loess([xVal], [yVal])[0]; console.log(yVal); console.log(loessData);

    Read the article

  • Student's t distribution in JavaScript

    - by Sai Emrys
    Google Spreadsheets currently does not support the standard function TDIST - i.e. the Student's t-distribution. This function is critical for calculating p-values. It seems that this is related to the fact that no integral-using functions (AFAICT) are implemented either. However, Google Docs allows people to add and publish their own scripts, in JavaScript. So ideally we should have something like: function tdist(t_value, degrees_of_freedom, two_tailed [defaults true]) {...} Anyone know of either an extant implementation of this (my google-fu has not turned up one, but may be weaker than yours) or a good idea for how to do it? I'd like to publish this together with some other useful functions that are currently calculable but a bit of a pain (like Student's t-test itself). Thanks!

    Read the article

  • Is there a Pair-Wise PostHoc Comparisons for the Chi-Square Test in R?

    - by Tal Galili
    Hi all, I am wondering if there exists in R a package/function to perform the: "Post Hoc Pair-Wise Comparisons for the Chi-Square Test of Homogeneity of Proportions" (or an equivalent of it) Which is described here: http://epm.sagepub.com/cgi/content/abstract/53/4/951 My situation is of just making a chi test, on a 2 by X matrix. I found a difference, but I want to know which of the columns is "responsible" for the difference. Thanks, Tal

    Read the article

  • Non-linear regression models in PostgreSQL using R

    - by Dave Jarvis
    Background I have climate data (temperature, precipitation, snow depth) for all of Canada between 1900 and 2009. I have written a basic website and the simplest page allows users to choose category and city. They then get back a very simple report (without the parameters and calculations section): The primary purpose of the web application is to provide a simple user interface so that the general public can explore the data in meaningful ways. (A list of numbers is not meaningful to the general public, nor is a website that provides too many inputs.) The secondary purpose of the application is to provide climatologists and other scientists with deeper ways to view the data. (Using too many inputs, of course.) Tool Set The database is PostgreSQL with R (mostly) installed. The reports are written using iReport and generated using JasperReports. Poor Model Choice Currently, a linear regression model is applied against annual averages of daily data. The linear regression model is calculated within a PostgreSQL function as follows: SELECT regr_slope( amount, year_taken ), regr_intercept( amount, year_taken ), corr( amount, year_taken ) FROM temp_regression INTO STRICT slope, intercept, correlation; The results are returned to JasperReports using: SELECT year_taken, amount, year_taken * slope + intercept, slope, intercept, correlation, total_measurements INTO result; JasperReports calls into PostgreSQL using the following parameterized analysis function: SELECT year_taken, amount, measurements, regression_line, slope, intercept, correlation, total_measurements, execute_time FROM climate.analysis( $P{CityId}, $P{Elevation1}, $P{Elevation2}, $P{Radius}, $P{CategoryId}, $P{Year1}, $P{Year2} ) ORDER BY year_taken This is not an optimal solution because it gives the false impression that the climate is changing at a slow, but steady rate. Questions Using functions that take two parameters (e.g., year [X] and amount [Y]), such as PostgreSQL's regr_slope: What is a better regression model to apply? What CPAN-R packages provide such models? (Installable, ideally, using apt-get.) How can the R functions be called within a PostgreSQL function? If no such functions exist: What parameters should I try to obtain for functions that will produce the desired fit? How would you recommend showing the best fit curve? Keep in mind that this is a web app for use by the general public. If the only way to analyse the data is from an R shell, then the purpose has been defeated. (I know this is not the case for most R functions I have looked at so far.) Thank you!

    Read the article

  • probability and relative frequency

    - by Alexandru
    If I use relative frequency to estimate the probability of an event, how good is my estimate based on the number of experiments? Is standard deviation a good measure? A paper/link/online book would be perfect. http://en.wikipedia.org/wiki/Frequentist

    Read the article

< Previous Page | 8 9 10 11 12 13 14 15 16 17 18 19  | Next Page >