Search Results

Search found 114 results on 5 pages for 'median'.

Page 2/5 | < Previous Page | 1 2 3 4 5  | Next Page >

  • Finding the median of the merged array of two sorted arrays in O(logN)?

    - by user176517
    Refering to the solution present at MIT handout I have tried to figure out the solution myself but have got stuck and I believe I need help to understand the following points. In the function header used in the solution MEDIAN -SEARCH (A[1 . . l], B[1 . . m], max(1,n/2 - m), min(l, n/2)) I do not understand the last two arguments why not simply 1, l why the max and min respectively. Thanking You.

    Read the article

  • k-d tree implementation [closed]

    - by user466441
    when i run my code and debugged,i got this error - this 0x00093584 {_Myproxy=0x00000000 _Mynextiter=0x00000000 } std::_Iterator_base12 * const - _Myproxy 0x00000000 {_Mycont=??? _Myfirstiter=??? } std::_Container_proxy * _Mycont CXX0017: Error: symbol "" not found _Myfirstiter CXX0030: Error: expression cannot be evaluated + _Mynextiter 0x00000000 {_Myproxy=??? _Mynextiter=??? } std::_Iterator_base12 * but i dont know what does it means,code is this #include<iostream> #include<vector> #include<algorithm> using namespace std; struct point { float x,y; }; vector<point>pointleft(4); vector<point>pointright(4); //we are going to implement two comparison function for x and y coordinates,we need it in calculation of median (we should sort vector //by x or y according to depth informaton,is depth even or odd. bool sortby_X(point &a,point &b) { return a.x<b.x; } bool sortby_Y(point &a,point &b) { return a.y<b.y; } //so i am going to implement to median finding algorithm,one for finding median by x and another find median by y point medianx(vector<point>points) { point temp; sort(points.begin(),points.end(),sortby_X); temp=points[(points.size()/2)]; return temp; } point mediany(vector<point>points) { point temp; sort(points.begin(),points.end(),sortby_Y); temp=points[(points.size()/2)]; return temp; } //now construct basic tree structure struct Tree { float x,y; Tree(point a) { x=a.x; y=a.y; } Tree *left; Tree *right; }; Tree * build_kd( Tree *root,vector<point>points,int depth) { point temp; if(points.size()==1)// that point is as a leaf { if(root==NULL) root=new Tree(points[0]); return root; } if(depth%2==0) { temp=medianx(points); root=new Tree(temp); for(int i=0;i<points.size();i++) { if (points[i].x<temp.x) pointleft[i]=points[i]; else pointright[i]=points[i]; } } else { temp=mediany(points); root=new Tree(temp); for(int i=0;i<points.size();i++) { if(points[i].y<temp.y) pointleft[i]=points[i]; else pointright[i]=points[i]; } } return build_kd(root->left,pointleft,depth+1); return build_kd(root->right,pointright,depth+1); } void print(Tree *root) { while(root!=NULL) { cout<<root->x<<" " <<root->y; print(root->left); print(root->right); } } int main() { int depth=0; Tree *root=NULL; vector<point>points(4); float x,y; int n=4; for(int i=0;i<n;i++) { cin>>x>>y; points[i].x=x; points[i].y=y; } root=build_kd(root,points,depth); print(root); return 0; } i am trying ti implement in c++ this pseudo code tuple function build_kd_tree(int depth, set points): if points contains only one point: return that point as a leaf. if depth is even: Calculate the median x-value. Create a set of points (pointsLeft) that have x-values less than the median. Create a set of points (pointsRight) that have x-values greater than or equal to the median. else: Calculate the median y-value. Create a set of points (pointsLeft) that have y-values less than the median. Create a set of points (pointsRight) that have y-values greater than or equal to the median. treeLeft = build_kd_tree(depth + 1, pointsLeft) treeRight = build_kd_tree(depth + 1, pointsRight) return(median, treeLeft, treeRight) please help me what this error means?

    Read the article

  • How do I get the median/mode/range of a column in SQL using Java?

    - by Derek
    I have to get the median, mode and range of test scores from one column in a table but I am unsure how to go about doing that. When you connect to the database using java, you are normally returned a ResultSet that you can make a table or something out of but how do you get particular numbers or digits? Is there an SQL command to get the median/mode/range or will I have to calculate this myself, and how do you pull out numbers from the table in order to be able to calculate the mode/median/range? Thanks.

    Read the article

  • Using a type parameter and a pointer to the same type parameter in a function template

    - by Darel
    Hello, I've written a template function to determine the median of any vector or array of any type that can be sorted with sort. The function and a small test program are below: #include <algorithm> #include <vector> #include <iostream> using namespace::std; template <class T, class X> void median(T vec, size_t size, X& ret) { sort(vec, vec + size); size_t mid = size/2; ret = size % 2 == 0 ? (vec[mid] + vec[mid-1]) / 2 : vec[mid]; } int main() { vector<double> v; v.push_back(2); v.push_back(8); v.push_back(7); v.push_back(4); v.push_back(9); double a[5] = {2, 8, 7, 4, 9}; double r; median(v.begin(), v.size(), r); cout << r << endl; median(a, 5, r); cout << r << endl; return 0; } As you can see, the median function takes a pointer as an argument, T vec. Also in the argument list is a reference variable X ret, which is modified by the function to store the computed median value. However I don't find this a very elegant solution. T vec will always be a pointer to the same type as X ret. My initial attempts to write median had a header like this: template<class T> T median(T *vec, size_t size) { sort(vec, vec + size); size_t mid = size/2; return size % 2 == 0 ? (vec[mid] + vec[mid-1]) / 2 : vec[mid]; } I also tried: template<class T, class X> X median(T vec, size_t size) { sort(vec, vec + size); size_t mid = size/2; return size % 2 == 0 ? (vec[mid] + vec[mid-1]) / 2 : vec[mid]; } I couldn't get either of these to work. My question is, can anyone show me a working implementation of either of my alternatives? Thanks for looking!

    Read the article

  • IN r, how to combine the summary together

    - by alex
    say i have 5 summary for 5 sets of data. how can i get those number out or combine the summary in to 1 rather than 5 V1 V2 V3 V4 Min. : 670.2 Min. : 682.3 Min. : 690.7 Min. : 637.6 1st Qu.: 739.9 1st Qu.: 737.2 1st Qu.: 707.7 1st Qu.: 690.7 Median : 838.6 Median : 798.6 Median : 748.3 Median : 748.3 Mean : 886.7 Mean : 871.0 Mean : 869.6 Mean : 865.4 3rd Qu.:1076.8 3rd Qu.:1027.6 3rd Qu.:1070.0 3rd Qu.: 960.8 Max. :1107.8 Max. :1109.3 Max. :1131.3 Max. :1289.6 V5 Min. : 637.6 1st Qu.: 690.7 Median : 748.3 Mean : 924.3 3rd Qu.: 960.8 Max. :1584.3 how can i have 1 table looks like v1 v2 v3 v4 v5 Min. : 1st Qu.: Median : Mean : 3rd Qu.: Max. : or how to save those number as vector so i can use matrix to generate a table

    Read the article

  • Using R to Analyze G1GC Log Files

    - by user12620111
    Using R to Analyze G1GC Log Files body, td { font-family: sans-serif; background-color: white; font-size: 12px; margin: 8px; } tt, code, pre { font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace; } h1 { font-size:2.2em; } h2 { font-size:1.8em; } h3 { font-size:1.4em; } h4 { font-size:1.0em; } h5 { font-size:0.9em; } h6 { font-size:0.8em; } a:visited { color: rgb(50%, 0%, 50%); } pre { margin-top: 0; max-width: 95%; border: 1px solid #ccc; white-space: pre-wrap; } pre code { display: block; padding: 0.5em; } code.r, code.cpp { background-color: #F8F8F8; } table, td, th { border: none; } blockquote { color:#666666; margin:0; padding-left: 1em; border-left: 0.5em #EEE solid; } hr { height: 0px; border-bottom: none; border-top-width: thin; border-top-style: dotted; border-top-color: #999999; } @media print { * { background: transparent !important; color: black !important; filter:none !important; -ms-filter: none !important; } body { font-size:12pt; max-width:100%; } a, a:visited { text-decoration: underline; } hr { visibility: hidden; page-break-before: always; } pre, blockquote { padding-right: 1em; page-break-inside: avoid; } tr, img { page-break-inside: avoid; } img { max-width: 100% !important; } @page :left { margin: 15mm 20mm 15mm 10mm; } @page :right { margin: 15mm 10mm 15mm 20mm; } p, h2, h3 { orphans: 3; widows: 3; } h2, h3 { page-break-after: avoid; } } pre .operator, pre .paren { color: rgb(104, 118, 135) } pre .literal { color: rgb(88, 72, 246) } pre .number { color: rgb(0, 0, 205); } pre .comment { color: rgb(76, 136, 107); } pre .keyword { color: rgb(0, 0, 255); } pre .identifier { color: rgb(0, 0, 0); } pre .string { color: rgb(3, 106, 7); } var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/"}while(y.length||w.length){var v=u().splice(0,1)[0];z+=m(x.substr(q,v.offset-q));q=v.offset;if(v.event=="start"){z+=t(v.node);s.push(v.node)}else{if(v.event=="stop"){var p,r=s.length;do{r--;p=s[r];z+=("")}while(p!=v.node);s.splice(r,1);while(r'+M[0]+""}else{r+=M[0]}O=P.lR.lastIndex;M=P.lR.exec(L)}return r+L.substr(O,L.length-O)}function J(L,M){if(M.sL&&e[M.sL]){var r=d(M.sL,L);x+=r.keyword_count;return r.value}else{return F(L,M)}}function I(M,r){var L=M.cN?'':"";if(M.rB){y+=L;M.buffer=""}else{if(M.eB){y+=m(r)+L;M.buffer=""}else{y+=L;M.buffer=r}}D.push(M);A+=M.r}function G(N,M,Q){var R=D[D.length-1];if(Q){y+=J(R.buffer+N,R);return false}var P=q(M,R);if(P){y+=J(R.buffer+N,R);I(P,M);return P.rB}var L=v(D.length-1,M);if(L){var O=R.cN?"":"";if(R.rE){y+=J(R.buffer+N,R)+O}else{if(R.eE){y+=J(R.buffer+N,R)+O+m(M)}else{y+=J(R.buffer+N+M,R)+O}}while(L1){O=D[D.length-2].cN?"":"";y+=O;L--;D.length--}var r=D[D.length-1];D.length--;D[D.length-1].buffer="";if(r.starts){I(r.starts,"")}return R.rE}if(w(M,R)){throw"Illegal"}}var E=e[B];var D=[E.dM];var A=0;var x=0;var y="";try{var s,u=0;E.dM.buffer="";do{s=p(C,u);var t=G(s[0],s[1],s[2]);u+=s[0].length;if(!t){u+=s[1].length}}while(!s[2]);if(D.length1){throw"Illegal"}return{r:A,keyword_count:x,value:y}}catch(H){if(H=="Illegal"){return{r:0,keyword_count:0,value:m(C)}}else{throw H}}}function g(t){var p={keyword_count:0,r:0,value:m(t)};var r=p;for(var q in e){if(!e.hasOwnProperty(q)){continue}var s=d(q,t);s.language=q;if(s.keyword_count+s.rr.keyword_count+r.r){r=s}if(s.keyword_count+s.rp.keyword_count+p.r){r=p;p=s}}if(r.language){p.second_best=r}return p}function i(r,q,p){if(q){r=r.replace(/^((]+|\t)+)/gm,function(t,w,v,u){return w.replace(/\t/g,q)})}if(p){r=r.replace(/\n/g,"")}return r}function n(t,w,r){var x=h(t,r);var v=a(t);var y,s;if(v){y=d(v,x)}else{return}var q=c(t);if(q.length){s=document.createElement("pre");s.innerHTML=y.value;y.value=k(q,c(s),x)}y.value=i(y.value,w,r);var u=t.className;if(!u.match("(\\s|^)(language-)?"+v+"(\\s|$)")){u=u?(u+" "+v):v}if(/MSIE [678]/.test(navigator.userAgent)&&t.tagName=="CODE"&&t.parentNode.tagName=="PRE"){s=t.parentNode;var p=document.createElement("div");p.innerHTML=""+y.value+"";t=p.firstChild.firstChild;p.firstChild.cN=s.cN;s.parentNode.replaceChild(p.firstChild,s)}else{t.innerHTML=y.value}t.className=u;t.result={language:v,kw:y.keyword_count,re:y.r};if(y.second_best){t.second_best={language:y.second_best.language,kw:y.second_best.keyword_count,re:y.second_best.r}}}function o(){if(o.called){return}o.called=true;var r=document.getElementsByTagName("pre");for(var p=0;p|=||=||=|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.ER="(?![\\s\\S])";this.BE={b:"\\\\.",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.inherit=function(r,s){var p={};for(var q in r){p[q]=r[q]}if(s){for(var q in s){p[q]=s[q]}}return p}}();hljs.LANGUAGES.cpp=function(){var a={keyword:{"false":1,"int":1,"float":1,"while":1,"private":1,"char":1,"catch":1,"export":1,virtual:1,operator:2,sizeof:2,dynamic_cast:2,typedef:2,const_cast:2,"const":1,struct:1,"for":1,static_cast:2,union:1,namespace:1,unsigned:1,"long":1,"throw":1,"volatile":2,"static":1,"protected":1,bool:1,template:1,mutable:1,"if":1,"public":1,friend:2,"do":1,"return":1,"goto":1,auto:1,"void":2,"enum":1,"else":1,"break":1,"new":1,extern:1,using:1,"true":1,"class":1,asm:1,"case":1,typeid:1,"short":1,reinterpret_cast:2,"default":1,"double":1,register:1,explicit:1,signed:1,typename:1,"try":1,"this":1,"switch":1,"continue":1,wchar_t:1,inline:1,"delete":1,alignof:1,char16_t:1,char32_t:1,constexpr:1,decltype:1,noexcept:1,nullptr:1,static_assert:1,thread_local:1,restrict:1,_Bool:1,complex:1},built_in:{std:1,string:1,cin:1,cout:1,cerr:1,clog:1,stringstream:1,istringstream:1,ostringstream:1,auto_ptr:1,deque:1,list:1,queue:1,stack:1,vector:1,map:1,set:1,bitset:1,multiset:1,multimap:1,unordered_set:1,unordered_map:1,unordered_multiset:1,unordered_multimap:1,array:1,shared_ptr:1}};return{dM:{k:a,i:"",k:a,r:10,c:["self"]}]}}}();hljs.LANGUAGES.r={dM:{c:[hljs.HCM,{cN:"number",b:"\\b0[xX][0-9a-fA-F]+[Li]?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+(?:[eE][+\\-]?\\d*)?L\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+\\.(?!\\d)(?:i\\b)?",e:hljs.IMMEDIATE_RE,r:1},{cN:"number",b:"\\b\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"keyword",b:"(?:tryCatch|library|setGeneric|setGroupGeneric)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\.",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\d+(?![\\w.])",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\b(?:function)",e:hljs.IMMEDIATE_RE,r:2},{cN:"keyword",b:"(?:if|in|break|next|repeat|else|for|return|switch|while|try|stop|warning|require|attach|detach|source|setMethod|setClass)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"literal",b:"(?:NA|NA_integer_|NA_real_|NA_character_|NA_complex_)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"literal",b:"(?:NULL|TRUE|FALSE|T|F|Inf|NaN)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"identifier",b:"[a-zA-Z.][a-zA-Z0-9._]*\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"|=||   Using R to Analyze G1GC Log Files   Using R to Analyze G1GC Log Files Introduction Working in Oracle Platform Integration gives an engineer opportunities to work on a wide array of technologies. My team’s goal is to make Oracle applications run best on the Solaris/SPARC platform. When looking for bottlenecks in a modern applications, one needs to be aware of not only how the CPUs and operating system are executing, but also network, storage, and in some cases, the Java Virtual Machine. I was recently presented with about 1.5 GB of Java Garbage First Garbage Collector log file data. If you’re not familiar with the subject, you might want to review Garbage First Garbage Collector Tuning by Monica Beckwith. The customer had been running Java HotSpot 1.6.0_31 to host a web application server. I was told that the Solaris/SPARC server was running a Java process launched using a commmand line that included the following flags: -d64 -Xms9g -Xmx9g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=80 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps -XX:+PrintFlagsFinal -XX:+DisableExplicitGC -XX:+UnlockExperimentalVMOptions -XX:ParallelGCThreads=8 Several sources on the internet indicate that if I were to print out the 1.5 GB of log files, it would require enough paper to fill the bed of a pick up truck. Of course, it would be fruitless to try to scan the log files by hand. Tools will be required to summarize the contents of the log files. Others have encountered large Java garbage collection log files. There are existing tools to analyze the log files: IBM’s GC toolkit The chewiebug GCViewer gchisto HPjmeter Instead of using one of the other tools listed, I decide to parse the log files with standard Unix tools, and analyze the data with R. Data Cleansing The log files arrived in two different formats. I guess that the difference is that one set of log files was generated using a more verbose option, maybe -XX:+PrintHeapAtGC, and the other set of log files was generated without that option. Format 1 In some of the log files, the log files with the less verbose format, a single trace, i.e. the report of a singe garbage collection event, looks like this: {Heap before GC invocations=12280 (full 61): garbage-first heap total 9437184K, used 7499918K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000) region size 4096K, 1 young (4096K), 0 survivors (0K) compacting perm gen total 262144K, used 144077K [0xffffffff40000000, 0xffffffff50000000, 0xffffffff50000000) the space 262144K, 54% used [0xffffffff40000000, 0xffffffff48cb3758, 0xffffffff48cb3800, 0xffffffff50000000) No shared spaces configured. 2014-05-14T07:24:00.988-0700: 60586.353: [GC pause (young) 7324M->7320M(9216M), 0.1567265 secs] Heap after GC invocations=12281 (full 61): garbage-first heap total 9437184K, used 7496533K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000) region size 4096K, 0 young (0K), 0 survivors (0K) compacting perm gen total 262144K, used 144077K [0xffffffff40000000, 0xffffffff50000000, 0xffffffff50000000) the space 262144K, 54% used [0xffffffff40000000, 0xffffffff48cb3758, 0xffffffff48cb3800, 0xffffffff50000000) No shared spaces configured. } A simple grep can be used to extract a summary: $ grep "\[ GC pause (young" g1gc.log 2014-05-13T13:24:35.091-0700: 3.109: [GC pause (young) 20M->5029K(9216M), 0.0146328 secs] 2014-05-13T13:24:35.440-0700: 3.459: [GC pause (young) 9125K->6077K(9216M), 0.0086723 secs] 2014-05-13T13:24:37.581-0700: 5.599: [GC pause (young) 25M->8470K(9216M), 0.0203820 secs] 2014-05-13T13:24:42.686-0700: 10.704: [GC pause (young) 44M->15M(9216M), 0.0288848 secs] 2014-05-13T13:24:48.941-0700: 16.958: [GC pause (young) 51M->20M(9216M), 0.0491244 secs] 2014-05-13T13:24:56.049-0700: 24.066: [GC pause (young) 92M->26M(9216M), 0.0525368 secs] 2014-05-13T13:25:34.368-0700: 62.383: [GC pause (young) 602M->68M(9216M), 0.1721173 secs] But that format wasn't easily read into R, so I needed to be a bit more tricky. I used the following Unix command to create a summary file that was easy for R to read. $ echo "SecondsSinceLaunch BeforeSize AfterSize TotalSize RealTime" $ grep "\[GC pause (young" g1gc.log | grep -v mark | sed -e 's/[A-SU-z\(\),]/ /g' -e 's/->/ /' -e 's/: / /g' | more SecondsSinceLaunch BeforeSize AfterSize TotalSize RealTime 2014-05-13T13:24:35.091-0700 3.109 20 5029 9216 0.0146328 2014-05-13T13:24:35.440-0700 3.459 9125 6077 9216 0.0086723 2014-05-13T13:24:37.581-0700 5.599 25 8470 9216 0.0203820 2014-05-13T13:24:42.686-0700 10.704 44 15 9216 0.0288848 2014-05-13T13:24:48.941-0700 16.958 51 20 9216 0.0491244 2014-05-13T13:24:56.049-0700 24.066 92 26 9216 0.0525368 2014-05-13T13:25:34.368-0700 62.383 602 68 9216 0.1721173 Format 2 In some of the log files, the log files with the more verbose format, a single trace, i.e. the report of a singe garbage collection event, was more complicated than Format 1. Here is a text file with an example of a single G1GC trace in the second format. As you can see, it is quite complicated. It is nice that there is so much information available, but the level of detail can be overwhelming. I wrote this awk script (download) to summarize each trace on a single line. #!/usr/bin/env awk -f BEGIN { printf("SecondsSinceLaunch IncrementalCount FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize\n") } ###################### # Save count data from lines that are at the start of each G1GC trace. # Each trace starts out like this: # {Heap before GC invocations=14 (full 0): # garbage-first heap total 9437184K, used 325496K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000) ###################### /{Heap.*full/{ gsub ( "\\)" , "" ); nf=split($0,a,"="); split(a[2],b," "); getline; if ( match($0, "first") ) { G1GC=1; IncrementalCount=b[1]; FullCount=substr( b[3], 1, length(b[3])-1 ); } else { G1GC=0; } } ###################### # Pull out time stamps that are in lines with this format: # 2014-05-12T14:02:06.025-0700: 94.312: [GC pause (young), 0.08870154 secs] ###################### /GC pause/ { DateTime=$1; SecondsSinceLaunch=substr($2, 1, length($2)-1); } ###################### # Heap sizes are in lines that look like this: # [ 4842M->4838M(9216M)] ###################### /\[ .*]$/ { gsub ( "\\[" , "" ); gsub ( "\ \]" , "" ); gsub ( "->" , " " ); gsub ( "\\( " , " " ); gsub ( "\ \)" , " " ); split($0,a," "); if ( split(a[1],b,"M") > 1 ) {BeforeSize=b[1]*1024;} if ( split(a[1],b,"K") > 1 ) {BeforeSize=b[1];} if ( split(a[2],b,"M") > 1 ) {AfterSize=b[1]*1024;} if ( split(a[2],b,"K") > 1 ) {AfterSize=b[1];} if ( split(a[3],b,"M") > 1 ) {TotalSize=b[1]*1024;} if ( split(a[3],b,"K") > 1 ) {TotalSize=b[1];} } ###################### # Emit an output line when you find input that looks like this: # [Times: user=1.41 sys=0.08, real=0.24 secs] ###################### /\[Times/ { if (G1GC==1) { gsub ( "," , "" ); split($2,a,"="); UserTime=a[2]; split($3,a,"="); SysTime=a[2]; split($4,a,"="); RealTime=a[2]; print DateTime,SecondsSinceLaunch,IncrementalCount,FullCount,UserTime,SysTime,RealTime,BeforeSize,AfterSize,TotalSize; G1GC=0; } } The resulting summary is about 25X smaller that the original file, but still difficult for a human to digest. SecondsSinceLaunch IncrementalCount FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize ... 2014-05-12T18:36:34.669-0700: 3985.744 561 0 0.57 0.06 0.16 1724416 1720320 9437184 2014-05-12T18:36:34.839-0700: 3985.914 562 0 0.51 0.06 0.19 1724416 1720320 9437184 2014-05-12T18:36:35.069-0700: 3986.144 563 0 0.60 0.04 0.27 1724416 1721344 9437184 2014-05-12T18:36:35.354-0700: 3986.429 564 0 0.33 0.04 0.09 1725440 1722368 9437184 2014-05-12T18:36:35.545-0700: 3986.620 565 0 0.58 0.04 0.17 1726464 1722368 9437184 2014-05-12T18:36:35.726-0700: 3986.801 566 0 0.43 0.05 0.12 1726464 1722368 9437184 2014-05-12T18:36:35.856-0700: 3986.930 567 0 0.30 0.04 0.07 1726464 1723392 9437184 2014-05-12T18:36:35.947-0700: 3987.023 568 0 0.61 0.04 0.26 1727488 1723392 9437184 2014-05-12T18:36:36.228-0700: 3987.302 569 0 0.46 0.04 0.16 1731584 1724416 9437184 Reading the Data into R Once the GC log data had been cleansed, either by processing the first format with the shell script, or by processing the second format with the awk script, it was easy to read the data into R. g1gc.df = read.csv("summary.txt", row.names = NULL, stringsAsFactors=FALSE,sep="") str(g1gc.df) ## 'data.frame': 8307 obs. of 10 variables: ## $ row.names : chr "2014-05-12T14:00:32.868-0700:" "2014-05-12T14:00:33.179-0700:" "2014-05-12T14:00:33.677-0700:" "2014-05-12T14:00:35.538-0700:" ... ## $ SecondsSinceLaunch: num 1.16 1.47 1.97 3.83 6.1 ... ## $ IncrementalCount : int 0 1 2 3 4 5 6 7 8 9 ... ## $ FullCount : int 0 0 0 0 0 0 0 0 0 0 ... ## $ UserTime : num 0.11 0.05 0.04 0.21 0.08 0.26 0.31 0.33 0.34 0.56 ... ## $ SysTime : num 0.04 0.01 0.01 0.05 0.01 0.06 0.07 0.06 0.07 0.09 ... ## $ RealTime : num 0.02 0.02 0.01 0.04 0.02 0.04 0.05 0.04 0.04 0.06 ... ## $ BeforeSize : int 8192 5496 5768 22528 24576 43008 34816 53248 55296 93184 ... ## $ AfterSize : int 1400 1672 2557 4907 7072 14336 16384 18432 19456 21504 ... ## $ TotalSize : int 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 ... head(g1gc.df) ## row.names SecondsSinceLaunch IncrementalCount ## 1 2014-05-12T14:00:32.868-0700: 1.161 0 ## 2 2014-05-12T14:00:33.179-0700: 1.472 1 ## 3 2014-05-12T14:00:33.677-0700: 1.969 2 ## 4 2014-05-12T14:00:35.538-0700: 3.830 3 ## 5 2014-05-12T14:00:37.811-0700: 6.103 4 ## 6 2014-05-12T14:00:41.428-0700: 9.720 5 ## FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize ## 1 0 0.11 0.04 0.02 8192 1400 9437184 ## 2 0 0.05 0.01 0.02 5496 1672 9437184 ## 3 0 0.04 0.01 0.01 5768 2557 9437184 ## 4 0 0.21 0.05 0.04 22528 4907 9437184 ## 5 0 0.08 0.01 0.02 24576 7072 9437184 ## 6 0 0.26 0.06 0.04 43008 14336 9437184 Basic Statistics Once the data has been read into R, simple statistics are very easy to generate. All of the numbers from high school statistics are available via simple commands. For example, generate a summary of every column: summary(g1gc.df) ## row.names SecondsSinceLaunch IncrementalCount FullCount ## Length:8307 Min. : 1 Min. : 0 Min. : 0.0 ## Class :character 1st Qu.: 9977 1st Qu.:2048 1st Qu.: 0.0 ## Mode :character Median :12855 Median :4136 Median : 12.0 ## Mean :12527 Mean :4156 Mean : 31.6 ## 3rd Qu.:15758 3rd Qu.:6262 3rd Qu.: 61.0 ## Max. :55484 Max. :8391 Max. :113.0 ## UserTime SysTime RealTime BeforeSize ## Min. :0.040 Min. :0.0000 Min. : 0.0 Min. : 5476 ## 1st Qu.:0.470 1st Qu.:0.0300 1st Qu.: 0.1 1st Qu.:5137920 ## Median :0.620 Median :0.0300 Median : 0.1 Median :6574080 ## Mean :0.751 Mean :0.0355 Mean : 0.3 Mean :5841855 ## 3rd Qu.:0.920 3rd Qu.:0.0400 3rd Qu.: 0.2 3rd Qu.:7084032 ## Max. :3.370 Max. :1.5600 Max. :488.1 Max. :8696832 ## AfterSize TotalSize ## Min. : 1380 Min. :9437184 ## 1st Qu.:5002752 1st Qu.:9437184 ## Median :6559744 Median :9437184 ## Mean :5785454 Mean :9437184 ## 3rd Qu.:7054336 3rd Qu.:9437184 ## Max. :8482816 Max. :9437184 Q: What is the total amount of User CPU time spent in garbage collection? sum(g1gc.df$UserTime) ## [1] 6236 As you can see, less than two hours of CPU time was spent in garbage collection. Is that too much? To find the percentage of time spent in garbage collection, divide the number above by total_elapsed_time*CPU_count. In this case, there are a lot of CPU’s and it turns out the the overall amount of CPU time spent in garbage collection isn’t a problem when viewed in isolation. When calculating rates, i.e. events per unit time, you need to ask yourself if the rate is homogenous across the time period in the log file. Does the log file include spikes of high activity that should be separately analyzed? Averaging in data from nights and weekends with data from business hours may alias problems. If you have a reason to suspect that the garbage collection rates include peaks and valleys that need independent analysis, see the “Time Series” section, below. Q: How much garbage is collected on each pass? The amount of heap space that is recovered per GC pass is surprisingly low: At least one collection didn’t recover any data. (“Min.=0”) 25% of the passes recovered 3MB or less. (“1st Qu.=3072”) Half of the GC passes recovered 4MB or less. (“Median=4096”) The average amount recovered was 56MB. (“Mean=56390”) 75% of the passes recovered 36MB or less. (“3rd Qu.=36860”) At least one pass recovered 2GB. (“Max.=2121000”) g1gc.df$Delta = g1gc.df$BeforeSize - g1gc.df$AfterSize summary(g1gc.df$Delta) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 3070 4100 56400 36900 2120000 Q: What is the maximum User CPU time for a single collection? The worst garbage collection (“Max.”) is many standard deviations away from the mean. The data appears to be right skewed. summary(g1gc.df$UserTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.040 0.470 0.620 0.751 0.920 3.370 sd(g1gc.df$UserTime) ## [1] 0.3966 Basic Graphics Once the data is in R, it is trivial to plot the data with formats including dot plots, line charts, bar charts (simple, stacked, grouped), pie charts, boxplots, scatter plots histograms, and kernel density plots. Histogram of User CPU Time per Collection I don't think that this graph requires any explanation. hist(g1gc.df$UserTime, main="User CPU Time per Collection", xlab="Seconds", ylab="Frequency") Box plot to identify outliers When the initial data is viewed with a box plot, you can see the one crazy outlier in the real time per GC. Save this data point for future analysis and drop the outlier so that it’s not throwing off our statistics. Now the box plot shows many outliers, which will be examined later, using times series analysis. Notice that the scale of the x-axis changes drastically once the crazy outlier is removed. par(mfrow=c(2,1)) boxplot(g1gc.df$UserTime,g1gc.df$SysTime,g1gc.df$RealTime, main="Box Plot of Time per GC\n(dominated by a crazy outlier)", names=c("usr","sys","elapsed"), xlab="Seconds per GC", ylab="Time (Seconds)", horizontal = TRUE, outcol="red") crazy.outlier.df=g1gc.df[g1gc.df$RealTime > 400,] g1gc.df=g1gc.df[g1gc.df$RealTime < 400,] boxplot(g1gc.df$UserTime,g1gc.df$SysTime,g1gc.df$RealTime, main="Box Plot of Time per GC\n(crazy outlier excluded)", names=c("usr","sys","elapsed"), xlab="Seconds per GC", ylab="Time (Seconds)", horizontal = TRUE, outcol="red") box(which = "outer", lty = "solid") Here is the crazy outlier for future analysis: crazy.outlier.df ## row.names SecondsSinceLaunch IncrementalCount ## 8233 2014-05-12T23:15:43.903-0700: 20741 8316 ## FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize ## 8233 112 0.55 0.42 488.1 8381440 8235008 9437184 ## Delta ## 8233 146432 R Time Series Data To analyze the garbage collection as a time series, I’ll use Z’s Ordered Observations (zoo). “zoo is the creator for an S3 class of indexed totally ordered observations which includes irregular time series.” require(zoo) ## Loading required package: zoo ## ## Attaching package: 'zoo' ## ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric head(g1gc.df[,1]) ## [1] "2014-05-12T14:00:32.868-0700:" "2014-05-12T14:00:33.179-0700:" ## [3] "2014-05-12T14:00:33.677-0700:" "2014-05-12T14:00:35.538-0700:" ## [5] "2014-05-12T14:00:37.811-0700:" "2014-05-12T14:00:41.428-0700:" options("digits.secs"=3) times=as.POSIXct( g1gc.df[,1], format="%Y-%m-%dT%H:%M:%OS%z:") g1gc.z = zoo(g1gc.df[,-c(1)], order.by=times) head(g1gc.z) ## SecondsSinceLaunch IncrementalCount FullCount ## 2014-05-12 17:00:32.868 1.161 0 0 ## 2014-05-12 17:00:33.178 1.472 1 0 ## 2014-05-12 17:00:33.677 1.969 2 0 ## 2014-05-12 17:00:35.538 3.830 3 0 ## 2014-05-12 17:00:37.811 6.103 4 0 ## 2014-05-12 17:00:41.427 9.720 5 0 ## UserTime SysTime RealTime BeforeSize AfterSize ## 2014-05-12 17:00:32.868 0.11 0.04 0.02 8192 1400 ## 2014-05-12 17:00:33.178 0.05 0.01 0.02 5496 1672 ## 2014-05-12 17:00:33.677 0.04 0.01 0.01 5768 2557 ## 2014-05-12 17:00:35.538 0.21 0.05 0.04 22528 4907 ## 2014-05-12 17:00:37.811 0.08 0.01 0.02 24576 7072 ## 2014-05-12 17:00:41.427 0.26 0.06 0.04 43008 14336 ## TotalSize Delta ## 2014-05-12 17:00:32.868 9437184 6792 ## 2014-05-12 17:00:33.178 9437184 3824 ## 2014-05-12 17:00:33.677 9437184 3211 ## 2014-05-12 17:00:35.538 9437184 17621 ## 2014-05-12 17:00:37.811 9437184 17504 ## 2014-05-12 17:00:41.427 9437184 28672 Example of Two Benchmark Runs in One Log File The data in the following graph is from a different log file, not the one of primary interest to this article. I’m including this image because it is an example of idle periods followed by busy periods. It would be uninteresting to average the rate of garbage collection over the entire log file period. More interesting would be the rate of garbage collect in the two busy periods. Are they the same or different? Your production data may be similar, for example, bursts when employees return from lunch and idle times on weekend evenings, etc. Once the data is in an R Time Series, you can analyze isolated time windows. Clipping the Time Series data Flashing back to our test case… Viewing the data as a time series is interesting. You can see that the work intensive time period is between 9:00 PM and 3:00 AM. Lets clip the data to the interesting period:     par(mfrow=c(2,1)) plot(g1gc.z$UserTime, type="h", main="User Time per GC\nTime: Complete Log File", xlab="Time of Day", ylab="CPU Seconds per GC", col="#1b9e77") clipped.g1gc.z=window(g1gc.z, start=as.POSIXct("2014-05-12 21:00:00"), end=as.POSIXct("2014-05-13 03:00:00")) plot(clipped.g1gc.z$UserTime, type="h", main="User Time per GC\nTime: Limited to Benchmark Execution", xlab="Time of Day", ylab="CPU Seconds per GC", col="#1b9e77") box(which = "outer", lty = "solid") Cumulative Incremental and Full GC count Here is the cumulative incremental and full GC count. When the line is very steep, it indicates that the GCs are repeating very quickly. Notice that the scale on the Y axis is different for full vs. incremental. plot(clipped.g1gc.z[,c(2:3)], main="Cumulative Incremental and Full GC count", xlab="Time of Day", col="#1b9e77") GC Analysis of Benchmark Execution using Time Series data In the following series of 3 graphs: The “After Size” show the amount of heap space in use after each garbage collection. Many Java objects are still referenced, i.e. alive, during each garbage collection. This may indicate that the application has a memory leak, or may indicate that the application has a very large memory footprint. Typically, an application's memory footprint plateau's in the early stage of execution. One would expect this graph to have a flat top. The steep decline in the heap space may indicate that the application crashed after 2:00. The second graph shows that the outliers in real execution time, discussed above, occur near 2:00. when the Java heap seems to be quite full. The third graph shows that Full GCs are infrequent during the first few hours of execution. The rate of Full GC's, (the slope of the cummulative Full GC line), changes near midnight.   plot(clipped.g1gc.z[,c("AfterSize","RealTime","FullCount")], xlab="Time of Day", col=c("#1b9e77","red","#1b9e77")) GC Analysis of heap recovered Each GC trace includes the amount of heap space in use before and after the individual GC event. During garbage coolection, unreferenced objects are identified, the space holding the unreferenced objects is freed, and thus, the difference in before and after usage indicates how much space has been freed. The following box plot and bar chart both demonstrate the same point - the amount of heap space freed per garbage colloection is surprisingly low. par(mfrow=c(2,1)) boxplot(as.vector(clipped.g1gc.z$Delta), main="Amount of Heap Recovered per GC Pass", xlab="Size in KB", horizontal = TRUE, col="red") hist(as.vector(clipped.g1gc.z$Delta), main="Amount of Heap Recovered per GC Pass", xlab="Size in KB", breaks=100, col="red") box(which = "outer", lty = "solid") This graph is the most interesting. The dark blue area shows how much heap is occupied by referenced Java objects. This represents memory that holds live data. The red fringe at the top shows how much data was recovered after each garbage collection. barplot(clipped.g1gc.z[,c("AfterSize","Delta")], col=c("#7570b3","#e7298a"), xlab="Time of Day", border=NA) legend("topleft", c("Live Objects","Heap Recovered on GC"), fill=c("#7570b3","#e7298a")) box(which = "outer", lty = "solid") When I discuss the data in the log files with the customer, I will ask for an explaination for the large amount of referenced data resident in the Java heap. There are two are posibilities: There is a memory leak and the amount of space required to hold referenced objects will continue to grow, limited only by the maximum heap size. After the maximum heap size is reached, the JVM will throw an “Out of Memory” exception every time that the application tries to allocate a new object. If this is the case, the aplication needs to be debugged to identify why old objects are referenced when they are no longer needed. The application has a legitimate requirement to keep a large amount of data in memory. The customer may want to further increase the maximum heap size. Another possible solution would be to partition the application across multiple cluster nodes, where each node has responsibility for managing a unique subset of the data. Conclusion In conclusion, R is a very powerful tool for the analysis of Java garbage collection log files. The primary difficulty is data cleansing so that information can be read into an R data frame. Once the data has been read into R, a rich set of tools may be used for thorough evaluation.

    Read the article

  • Django QuerySet filter + order_by + limit

    - by handsofaten
    So I have a Django app that processes test results, and I'm trying to find the median score for a certain assessment. I would think that this would work: e = Exam.objects.all() total = e.count() median = int(round(total / 2)) median_exam = Exam.objects.filter(assessment=assessment.id).order_by('score')[median:1] median_score = median_exam.score But it always returns an empty list. I can get the result I want with this: e = Exam.objects.all() total = e.count() median = int(round(total / 2)) exams = Exam.objects.filter(assessment=assessment.id).order_by('score') median_score = median_exam[median].score I would just prefer not to have to query the entire set of exams. I thought about just writing a raw MySQL query that looks something like: SELECT score FROM assess_exam WHERE assessment_id = 5 ORDER BY score LIMIT 690,1 But if possible, I'd like to stay within Django's ORM. Mostly, it's just bothering me that I can't seem to use order_by with a filter and a limit. Any ideas?

    Read the article

  • Java xpath selection

    - by Travis
    I'm having a little trouble getting values out of an XML document. The document looks like this: <marketstat> <type id="35"> <sell> <median>6.00</median> </sell> </type> <type id="34"> <sell> <median>2.77</median> </sell> </type> </marketstat> I need to get the median where type = x. I've always had trouble figuring out xpath with Java and I can never find any good tutorials or references for this. If anyone could help me figure this out that would be great.

    Read the article

  • how do I stop excel from adding double quotes to my formula

    - by Alex
    this works: {=MEDIAN((Table1[MonthFinish]=201012)*(Table1[Days]))} but if I put 201012 into cell A3, this doesn't done work: {=MEDIAN((Table1[MonthFinish]=A3)*(Table1[Days]))} when i do Evaluate Formula on the 2nd one...I see that there are double quotes about the 201012 that was pulled from A3...like so: {=MEDIAN((Table1[MonthFinish]="201012")*(Table1[Days]))} and as such, all the 201012s pulled from the MonthFinsh row come back as FALSE when compared to "201012" (ie, 201012="201012" ) where as they come back as TRUE when I hard code 201012 as it shows up as 201012=201012. how do i get even to not put those quotes around the number?

    Read the article

  • min() and max() give error: TypeError: 'float' object is not iterable

    - by PythonUser3.3
    markList=[] Lmark=0 Hmark=0 while True: mark=float(input("Enter your marks here(Click -1 to exit)")) if mark == -1: break markList.append(mark) markList.sort() mid = len(markList)//2 if len(markList)%2==0: median=(markList[mid]+ markList[mid-1])/2 print("Median:", median) else: print("Median:" , markList[mid]) Lmark==(min(mark)) print("The lowest mark is", Lmark) Hmark==(max(mark)) print("The highest mark is", Hmark) My program is a basic grade calculator using lists. My program asks the user to input their grades into a list in which it then calculates your average and finds your lowest and highest mark. I have found the average but I can't seem to figure out how to find the lowest and highest grade. Can you please show me pr tell me what to do?

    Read the article

  • I'm looking for a way to evaluate reading rate in several languages

    - by i30817
    I have a software that is page oriented instead of scrollbar oriented so i can easily count the words, but i'd like a way to filter outliers and some default value for the text language (that is known). The goal is from the remaining text to calculate the remaining time. I'm not sure what is the best unit to use. WPM (words per minute) from here seems very fuzzy and human oriented. Besides i don't know how many "words" remain in the text. http://www.sfsu.edu/~testing/CalReadRate.htm So i came up with this: The user is reading the text. The total text size in characters is known. His position in the text is known. So the remaining characters to read is also known. If a language has a median word length of say 5 chars, then if i had a WPM speed for the user, i could calculate the remaining time. 3 things are needed for this: 1) A table of the median word length of the language. 2) A table of the median WPM of a median user per language. 3) Update the WPM to fit the user as data becomes available, filtering outliers. However i can't find these tables. And i'm not sure how precise it is assuming median word length.

    Read the article

  • Turn-around time in PHP

    - by user73409
    Is there any one who had tried to build/convert a php version of the Excel method in computing Turn-around time(excluding holidays, weekends and non-business hours)? Excel Turn-around Time Computation: =(NETWORKDAYS(A2,B2,H$1:H$10)-1)*("17:00"-"8:00")+IF(NETWORKDAYS(B2,B2,H$1:H$10),MEDIAN(MOD(B2,1),"17:00","8:00"),"17:00")-MEDIAN(NETWORKDAYS(A2,A2,H$1:H$10)*MOD(A2,1),"17:00","8:00") :REF-URL[http://www.mrexcel.com/forum/excel-questions/514097-i-need-formual-calculate-turn-around-time.html] Thanks.

    Read the article

  • Specific issue on data pump API in oracle

    - by Median Hilal
    I have a client/server architecture. Using an Oracle dbms on the database server side. I need to perform a user-triggered (from client side) backup of the database, where the best way to perform that is using a stored procedure on the server side which the client may call, as the client has no oracle tools to perform the backup. I've searched thorough inside available solutions and have found that using a stored procedure is the best way. Well, then I found that using oracle data pump API is the best way to use inside a PL/SQl stored procedure. My specific questions about the API are... I would like to ask about two issues ... ---- The first ----- the detach function to detach the handler, is it necessary to be used at the end of the procedure? and what if I don't use it? I read the Oracle documentation but I didn't get their point, they say it doesn't terminate the job but indicates that the user is not interested in it, an when I use detach at the end of my procedure the exported .dmp file disappears. ---- The second ----- to perform a user (client side) triggered back up as the modification are only to the data, I used TABLE parameter for the export operation. But the version parameter... what should it be? I also read the documentation but couldn't determine what I need (LATEST or COMPATIBLE) ? Thanks

    Read the article

  • SQL SERVER – Introduction to PERCENTILE_CONT() – Analytic Functions Introduced in SQL Server 2012

    - by pinaldave
    SQL Server 2012 introduces new analytical function PERCENTILE_CONT(). The book online gives following definition of this function: Computes a specific percentile for sorted values in an entire rowset or within distinct partitions of a rowset in Microsoft SQL Server 2012 Release Candidate 0 (RC 0). For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P. If you are clear with understanding of the function – no need to read further. If you got lost here is the same in simple words – it is lot like finding median with percentile value. Now let’s have fun following query: USE AdventureWorks GO SELECT SalesOrderID, OrderQty, ProductID, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ProductID) OVER (PARTITION BY SalesOrderID) AS MedianCont FROM Sales.SalesOrderDetail WHERE SalesOrderID IN (43670, 43669, 43667, 43663) ORDER BY SalesOrderID DESC GO The above query will give us the following result: You can see that I have used PERCENTILE_COUNT(0.5) in query, which is similar to finding median. Let me explain above diagram with little more explanation. The defination of median is as following: In case of Even Number of elements = In ordered list add the two digits from the middle and devide by 2 In case of Odd Numbers of elements = In ordered list select the digits from the middle I hope this example gives clear idea how PERCENTILE_CONT() works. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: Pinal Dave, PostADay, SQL, SQL Authority, SQL Function, SQL Query, SQL Scripts, SQL Server, SQL Tips and Tricks, T SQL, Technology

    Read the article

  • How to customize notches in ggplot boxplot

    - by cjy8709
    I had a question on how to change/customize the upper and lower limit of a notch on a boxplot created by ggplot2. I looked through the function stat_boxplot and found that ggplot calculates the notch limits with the equation median +/- 1.58 * iqr / sqrt(n). However instead of that equation I wanted to change it with my own set of upper and lower notch limits. My data has 4 factors and for each factor I calculated the median and did a bootstrap to get a 95% confidence interval of that median. Thus in the end I would like to change every boxplot to have its own unique notch upper and lower limit. I'm not sure if this is even possible in ggplot and was wondering if people have an idea on how to do this? Thanks again!

    Read the article

  • SQL SERVER – Puzzle to Win Print Book – Explain Value of PERCENTILE_CONT() Using Simple Example

    - by pinaldave
    From last several days I am working on various Denali Analytical functions and it is indeed really fun to refresh the concept which I studied in the school. Earlier I wrote article where I explained how we can use PERCENTILE_CONT() to find median over here SQL SERVER – Introduction to PERCENTILE_CONT() – Analytic Functions Introduced in SQL Server 2012. Today I am going to ask question based on the same blog post. Again just like last time the intention of this puzzle is as following: Learn new concept of SQL Server 2012 Learn new concept of SQL Server 2012 even if you are on earlier version of SQL Server. On another note, SQL Server 2012 RC0 has been announced and available to download SQL SERVER – 2012 RC0 Various Resources and Downloads. Now let’s have fun following query: USE AdventureWorks GO SELECT SalesOrderID, OrderQty, ProductID, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ProductID) OVER (PARTITION BY SalesOrderID) AS MedianCont FROM Sales.SalesOrderDetail WHERE SalesOrderID IN (43670, 43669, 43667, 43663) ORDER BY SalesOrderID DESC GO The above query will give us the following result: The reason we get median is because we are passing value .05 to PERCENTILE_COUNT() function. Now run read the puzzle. Puzzle: Run following T-SQL code: USE AdventureWorks GO SELECT SalesOrderID, OrderQty, ProductID, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY ProductID) OVER (PARTITION BY SalesOrderID) AS MedianCont FROM Sales.SalesOrderDetail WHERE SalesOrderID IN (43670, 43669, 43667, 43663) ORDER BY SalesOrderID DESC GO Observe the result and you will notice that MidianCont has different value than before, the reason is PERCENTILE_CONT function has 0.9 value passed. For first four value the value is 775.1. Now run following T-SQL code: USE AdventureWorks GO SELECT SalesOrderID, OrderQty, ProductID, PERCENTILE_CONT(0.1) WITHIN GROUP (ORDER BY ProductID) OVER (PARTITION BY SalesOrderID) AS MedianCont FROM Sales.SalesOrderDetail WHERE SalesOrderID IN (43670, 43669, 43667, 43663) ORDER BY SalesOrderID DESC GO Observe the result and you will notice that MidianCont has different value than before, the reason is PERCENTILE_CONT function has 0.1 value passed. For first four value the value is 709.3. Now in my example I have explained how the median is found using this function. You have to explain using mathematics and explain (in easy words) why the value in last columns are 709.3 and 775.1 Hint: SQL SERVER – Introduction to PERCENTILE_CONT() – Analytic Functions Introduced in SQL Server 2012 Rules Leave a comment with your detailed answer by Nov 25's blog post. Open world-wide (where Amazon ships books) If you blog about puzzle’s solution and if you win, you win additional surprise gift as well. Prizes Print copy of my new book SQL Server Interview Questions Amazon|Flipkart If you already have this book, you can opt for any of my other books SQL Wait Stats [Amazon|Flipkart|Kindle] and SQL Programming [Amazon|Flipkart|Kindle]. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: Pinal Dave, PostADay, SQL, SQL Authority, SQL Function, SQL Puzzle, SQL Query, SQL Scripts, SQL Server, SQL Tips and Tricks, T SQL, Technology

    Read the article

  • PMDB Block Size Choice

    - by Brian Diehl
    Choosing a block size for the P6 PMDB database is not a difficult task. In fact, taking the default of 8k is going to be just fine. Block size is one of those things that is always hotly debated. Everyone has their personal preference and can sight plenty of good reasons for their choice. To add to the confusion, Oracle supports multiple block sizes withing the same instance. So how to decide and what is the justification? Like most OLTP systems, Oracle Primavera P6 has a wide variety of data. A typical table's average row size may be less than 50 bytes or upwards of 500 bytes. There are also several tables with BLOB types but the LOB data tends not to be very large. It is likely that no single block size would be perfect for every table. So how to choose? My preference is for the 8k (8192 bytes) block size. It is a good compromise that is not too small for the wider rows, yet not to big for the thin rows. It is also important to remember that database blocks are the smallest unit of change and caching. I prefer to have more, individual "working units" in my database. For an instance with 4gb of buffer cache, an 8k block will provide 524,288 blocks of cache. The following SQL*Plus script returns the average, median, min, and max rows per block. column "AVG(CNT)" format 999.99 set verify off select avg(cnt), median(cnt), min(cnt), max(cnt), count(*) from ( select dbms_rowid.ROWID_RELATIVE_FNO(rowid) , dbms_rowid.ROWID_BLOCK_NUMBER(rowid) , count(*) cnt from &tab group by dbms_rowid.ROWID_RELATIVE_FNO(rowid) , dbms_rowid.ROWID_BLOCK_NUMBER(rowid) ) Running this for the TASK table, I get this result on a database with an 8k block size. Each activity, on average, has about 19 rows per block. Enter value for tab: task AVG(CNT) MEDIAN(CNT) MIN(CNT) MAX(CNT) COUNT(*) -------- ----------- ---------- ---------- ---------- 18.72 19 3 28 415917 I recommend an 8k block size for the P6 transactional database. All of our internal performance and scalability test are done with this block size. This does not mean that other block sizes will not work. Instead, like many other parameters, this is the safest choice.

    Read the article

  • Performance: Nginx SSL slowness or just SSL slowness in general?

    - by Mauvis Ledford
    I have an Amazon Web Services setup with an Apache instance behind Nginx with Nginx handling SSL and serving everything but the .php pages. In my ApacheBench tests I'm seeing this for my most expensive API call (which cache via Memcached): 100 concurrent calls to API call (http): 115ms (median) 260ms (max) 100 concurrent calls to API call (https): 6.1s (median) 11.9s (max) I've done a bit of research, disabled the most expensive SSL ciphers and enabled SSL caching (I know it doesn't help in this particular test.) Can you tell me why my SSL is taking so long? I've set up a massive EC2 server with 8CPUs and even applying consistent load to it only brings it up to 50% total CPU. I have 8 Nginx workers set and a bunch of Apache. Currently this whole setup is on one EC2 box but I plan to split it up and load balance it. There have been a few questions on this topic but none of those answers (disable expensive ciphers, cache ssl, seem to do anything.) Sample results below: $ ab -k -n 100 -c 100 https://URL This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking URL.com (be patient).....done Server Software: nginx/1.0.15 Server Hostname: URL.com Server Port: 443 SSL/TLS Protocol: TLSv1/SSLv3,AES256-SHA,2048,256 Document Path: /PATH Document Length: 73142 bytes Concurrency Level: 100 Time taken for tests: 12.204 seconds Complete requests: 100 Failed requests: 0 Write errors: 0 Keep-Alive requests: 0 Total transferred: 7351097 bytes HTML transferred: 7314200 bytes Requests per second: 8.19 [#/sec] (mean) Time per request: 12203.589 [ms] (mean) Time per request: 122.036 [ms] (mean, across all concurrent requests) Transfer rate: 588.25 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 65 168 64.1 162 268 Processing: 385 6096 3438.6 6199 11928 Waiting: 379 6091 3438.5 6194 11923 Total: 449 6264 3476.4 6323 12196 Percentage of the requests served within a certain time (ms) 50% 6323 66% 8244 75% 9321 80% 9919 90% 11119 95% 11720 98% 12076 99% 12196 100% 12196 (longest request)

    Read the article

  • stats::reorder vs Hmisc::reorder

    - by learnr
    I am trying to get around the strange overlap of stats::reorder vs Hmisc::reorder. Without Hmisc loaded I get the result I want, i.e. an unordered factor: > with(InsectSprays, reorder(spray, count, median)) [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F attr(,"scores") A B C D E F 14.0 16.5 1.5 5.0 3.0 15.0 Levels: C E D A F B Now after loading Hmisc the result is an ordered factor: > library(Hmisc) Loading required package: survival Loading required package: splines Attaching package: 'Hmisc' The following object(s) are masked from 'package:survival': untangle.specials The following object(s) are masked from 'package:base': format.pval, round.POSIXt, trunc.POSIXt, units > with(InsectSprays, reorder(spray, count, median)) [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F Levels: C < E < D < A < F < B In calling stats::reorder directly, I now for some reason get an ordered factor. > with(InsectSprays, stats::reorder(spray, count, median)) [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F Levels: C < E < D < A < F < B Specifying, that I would need an unordered factor results in an error suggesting that stats::reorder is not used? > with(InsectSprays, stats::reorder(spray, count, median, order = FALSE)) Error in FUN(X[[1L]], ...) : unused argument(s) (order = FALSE) So the question really is how do I get an unordered factor with Hmisc loaded?

    Read the article

  • in R, question about generate table

    - by alex
    a = matrix(1:25,5,5) B = capture.output(for (X in 1:5){ A = c(min(a[,X]),quantile(a[,X],0.25),median(a[,X]),quantile(a[,X],0.75),max(a[,X]),mean(a[,X]),sd(a[,X])/m^(1/2),var(a[,X])) cat(A,"\n") }) matrix(B,8,5) what i was trying to do is to generate a table which each column has those element in A and in that order. i try to use the matrix, but seems like it dont reli work here...can anyone help 1 2 3 4 5 min 1st quartile median SEM VAR THIS IS WHAT I WANT THE TABLE LOOKS LIKE ..

    Read the article

  • Stata Nearest neighbor of percentile

    - by Kyle Billings
    This has probably already been answered, but I must just be searching for the wrong terms. Suppose I am using the built in Stata data set auto: sysuse auto, clear and say for example I am working with 1 independent and 1 dependent variable and I want to essentially compress down to the IQR elements, min, p(25), median, p(75), max... so I use command, keep weight mpg sum weight, detail return list local min=r(min) local lqr=r(p25) local med = r(p50) local uqr = r(p75) local max = r(max) keep if weight==`min' | weight==`max' | weight==`med' | weight==`lqr' | weight==`uqr' Hence, I want to compress the data set down to only those 5 observations, and for example in this situation the median is not actually an element of the weight vector. there is an observation above and an observation below (due to the definition of median this is no surprise). is there a way that I can tell stata to look for the nearest neighbor above the percentile. ie. if r(p50) is not an element of weight then search above that value for the next observation? The end result is I am trying to get the data down to 2 vectors, say weight and mpg such that for each of the 5 elements of weight in the IQR have their matching response in mpg. Any thoughts?

    Read the article

  • SPARC T4-4 Delivers World Record Performance on Oracle OLAP Perf Version 2 Benchmark

    - by Brian
    Oracle's SPARC T4-4 server delivered world record performance with subsecond response time on the Oracle OLAP Perf Version 2 benchmark using Oracle Database 11g Release 2 running on Oracle Solaris 11. The SPARC T4-4 server achieved throughput of 430,000 cube-queries/hour with an average response time of 0.85 seconds and the median response time of 0.43 seconds. This was achieved by using only 60% of the available CPU resources leaving plenty of headroom for future growth. The SPARC T4-4 server operated on an Oracle OLAP cube with a 4 billion row fact table of sales data containing 4 dimensions. This represents as many as 90 quintillion aggregate rows (90 followed by 18 zeros). Performance Landscape Oracle OLAP Perf Version 2 Benchmark 4 Billion Fact Table Rows System Queries/hour Users* Response Time (sec) Average Median SPARC T4-4 430,000 7,300 0.85 0.43 * Users - the supported number of users with a given think time of 60 seconds Configuration Summary and Results Hardware Configuration: SPARC T4-4 server with 4 x SPARC T4 processors, 3.0 GHz 1 TB memory Data Storage 1 x Sun Fire X4275 (using COMSTAR) 2 x Sun Storage F5100 Flash Array (each with 80 FMODs) Redo Storage 1 x Sun Fire X4275 (using COMSTAR with 8 HDD) Software Configuration: Oracle Solaris 11 11/11 Oracle Database 11g Release 2 (11.2.0.3) with Oracle OLAP option Benchmark Description The Oracle OLAP Perf Version 2 benchmark is a workload designed to demonstrate and stress the Oracle OLAP product's core features of fast query, fast update, and rich calculations on a multi-dimensional model to support enhanced Data Warehousing. The bulk of the benchmark entails running a number of concurrent users, each issuing typical multidimensional queries against an Oracle OLAP cube consisting of a number of years of sales data with fully pre-computed aggregations. The cube has four dimensions: time, product, customer, and channel. Each query user issues approximately 150 different queries. One query chain may ask for total sales in a particular region (e.g South America) for a particular time period (e.g. Q4 of 2010) followed by additional queries which drill down into sales for individual countries (e.g. Chile, Peru, etc.) with further queries drilling down into individual stores, etc. Another query chain may ask for yearly comparisons of total sales for some product category (e.g. major household appliances) and then issue further queries drilling down into particular products (e.g. refrigerators, stoves. etc.), particular regions, particular customers, etc. Results from version 2 of the benchmark are not comparable with version 1. The primary difference is the type of queries along with the query mix. Key Points and Best Practices Since typical BI users are often likely to issue similar queries, with different constants in the where clauses, setting the init.ora prameter "cursor_sharing" to "force" will provide for additional query throughput and a larger number of potential users. Except for this setting, together with making full use of available memory, out of the box performance for the OLAP Perf workload should provide results similar to what is reported here. For a given number of query users with zero think time, the main measured metrics are the average query response time, the median query response time, and the query throughput. A derived metric is the maximum number of users the system can support achieving the measured response time assuming some non-zero think time. The calculation of the maximum number of users follows from the well-known response-time law N = (rt + tt) * tp where rt is the average response time, tt is the think time and tp is the measured throughput. Setting tt to 60 seconds, rt to 0.85 seconds and tp to 119.44 queries/sec (430,000 queries/hour), the above formula shows that the T4-4 server will support 7,300 concurrent users with a think time of 60 seconds and an average response time of 0.85 seconds. For more information see chapter 3 from the book "Quantitative System Performance" cited below. -- See Also Quantitative System Performance Computer System Analysis Using Queueing Network Models Edward D. Lazowska, John Zahorjan, G. Scott Graham, Kenneth C. Sevcik external local Oracle Database 11g – Oracle OLAP oracle.com OTN SPARC T4-4 Server oracle.com OTN Oracle Solaris oracle.com OTN Oracle Database 11g Release 2 oracle.com OTN Disclosure Statement Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 11/2/2012.

    Read the article

< Previous Page | 1 2 3 4 5  | Next Page >