Search Results

Search found 1172 results on 47 pages for 'sparc spread'.

Page 37/47 | < Previous Page | 33 34 35 36 37 38 39 40 41 42 43 44  | Next Page >

  • How does one make sure or even guarantee server time are sync correctly between dozens of servers across multiple datacenter on different location?

    - by forestclown
    Currently our web applications contain a logic to check if the data sent to the web server is expired or not by comparing the timestamp of the data with the date/time of the server. Everything goes will, until some dude from data center accidentally modify one of the web server date/time and causes some disruptions in our web services. My managers are of course not happy with this, and said we shouldn't use timestamp to check expiry in the first place...anyway.... Network Time Protocol is implemented, because of data centers are spread across different continents so we have one NTP server in each data center. The servers within the data center will have cron jobs to check against the time with their NTP server from the same data center. If time is out of sync it will auto update the server date/time. But then with our managers not happy with it, and think it could still easily causes the same problem. e.g. what if someone accidentally modify the NTP date/time? what if all the NTP servers are out of sync with each other? which NTP servers we can really trust? and blah blah.. So my questions are: What are the current practice to sync date/time between servers across multiple data centers or locations? How does one manages time stamp between web apps? e.g. Server A send data (contain timestamp of Server A) to Server B (compare timestamp between Server B and the timestamp from the data to see if it has expired or not. This is to avoid HTTP replay) Should we really not use timestamp check? Thanks & Best Regards

    Read the article

  • How to debug silent hang on shutdown of Solaris 10?

    - by jblaine
    We're experiencing a mysterious hang on shutdown of a newly-imaged Oracle/Sun Solaris 10 SPARC box. It is repeatable (in the same spot ... from what we can tell). We let it try to work itself out multiple times for 5-10 minutes and it never progressed. I've never seen this happen before. The last thing displayed on the console is that syslogd was sent signal 15. Prior to us disabling snmpdx on the box, the last thing on the console was that snmpdx was sent signal 15 (after syslogd was sent signal 15). While very rare to find, in Solaris days past, I'd have a better idea from experience where the problem might be, and could then narrow things down further with silly (but effective) debugging echo statments in /etc/*.d scripts. With SMF in the picture, I'm not really quite sure where to start. We forced a crash dump via sync at the {ok} prompt for later analysis, and then let the box come up because it's a production server and our scheduled outage window was closing. /var/adm/messages shows nothing of use. How would you debug this situation? mdb ps of the savecore shows the following processes were running at hang time (afsd is the OpenAFS client and that many are expected): > > ::ps S PID PPID PGID SID UID FLAGS ADDR NAME R 0 0 0 0 0 0x00000001 00000000018387c0 sched R 108 0 0 0 0 0x00020001 00000600110fe010 zpool-silmaril-p R 3 0 0 0 0 0x00020001 0000060010b29848 fsflush R 2 0 0 0 0 0x00020001 0000060010b2a468 pageout R 1 0 0 0 0 0x4a024000 0000060010b2b088 init R 1327 1 1327 329 0 0x4a024002 00000600176ab0c0 reboot R 747 1 7 7 0 0x42020001 0000060017f9d0e0 afsd R 749 1 7 7 0 0x42020001 00000600180104d0 afsd R 752 1 7 7 0 0x42020001 0000060017cb44b8 afsd R 754 1 7 7 0 0x42020001 0000060017fc8068 afsd R 756 1 7 7 0 0x42020001 0000060017fcb0e8 afsd R 760 1 7 7 0 0x42020001 00000600177f4048 afsd R 762 1 7 7 0 0x42020001 000006001800f8b0 afsd R 764 1 7 7 0 0x42020001 000006001800ec90 afsd R 378 1 378 378 0 0x42020000 0000060013aee480 inetd R 7 1 7 7 0 0x42020000 0000060010b28008 svc.startd R 329 7 329 329 0 0x4a024000 00000600110ff850 sh Z 317 7 317 317 0 0x4a014002 0000060013b3a490 sac

    Read the article

  • How do I collect SNMP readings from intermittently-connected sites?

    - by Luke404
    I am collecting SNMP data on-site for a number of systems, currently using Cacti. These systems are spread on a number of sites that aren't always connected to internet, but I also need to centralize the data on a single system (datacenter housed server) and get graphs out of it. If I directly poll remote systems with a centralized Cacti I'd loose data when a site is not connected to internet. I should record data on-site (I have a server at each site and I can run whatever I want on it) and then 'sync' everything to the central system. One hack could be a cacti or directly an rrdtool on site and then periodically rsync RRD data to the central Cacti system, but that doesn't sound like a 'clean' solution: every RRD would have to be defined at both places and rsync scripts setup with the specific file names. Can you suggest a better solution? Cacti is not a requirement but I'd like to use something like that on the central system. On-site systems need only to collect data I don't need to graph it there or manage users rights to view data and stuff like that, users will only access the centralized system.

    Read the article

  • Does btrfs balance also defragment files?

    - by pauldoo
    When I run btrfs filesystem balance, does this implicitly defragment files? I could imagine that balance simply reallocates each file extent separately, preserving the existing fragmentation. There is an FAQ entry, 'What does "balance" do?', which is unclear on this point: btrfs filesystem balance is an operation which simply takes all of the data and metadata on the filesystem, and re-writes it in a different place on the disks, passing it through the allocator algorithm on the way. It was originally designed for multi-device filesystems, to spread data more evenly across the devices (i.e. to "balance" their usage). This is particularly useful when adding new devices to a nearly-full filesystem. Due to the way that balance works, it also has some useful side-effects: If there is a lot of allocated but unused data or metadata chunks, a balance may reclaim some of that allocated space. This is the main reason for running a balance on a single-device filesystem. On a filesystem with damaged replication (e.g. a RAID-1 FS with a dead and removed disk), it will force the FS to rebuild the missing copy of the data on one of the currently active devices, restoring the RAID-1 capability of the filesystem.

    Read the article

  • lsof not showing what port a proc is listening on

    - by ericslaw
    I have many processes on a box listening on several ports. I am trying to map ports to pids. The problem is that lsof is not telling me what ports belong to which process. Given an apache listening on port 80, I can see it listening via netstat: user@host% netstat -an|grep LISTEN|grep 80 *.80 *.* 0 0 49152 0 LISTEN But when I try to map port 80 to a pid I get nothing: user@host% lsof -iTCP:80 -t When I try seeing what sockets that specific pid is using I get: user@host% lsof -lnP -p31 -a -i COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME libhttpd. 31 0 15u IPv4 0x6002d970b80 0t0 TCP *:65535 (LISTEN) Notice the *:65535 in the NAME column. Does anyone know why lsof is not reporting the port in use? I am running as root. I am using a mix of lsof and os versions: lsof v4.77 on Solaris10 sparc lsof v4.72 on Redhat4.2 etc I know that linux solutions can use "netstat -p", so I guess I'm only looking for why solaris isn't working, but I find lsof is frequently silent and not showing me expected data.

    Read the article

  • IPC between multiple processes on multiple servers

    - by z8000
    Let's say you have 2 servers each with 8 CPU cores each. The servers each run 8 network services that each host an arbitrary number of long-lived TCP/IP client connections. Clients send messages to the services. The services do something based on the messages, and potentially notify N1 of the clients of state changes. Sure, it sounds like a botnet but it isn't. Consider how IRC works with c2s and s2s connections and s2s message relaying. The servers are in the same data center. The servers can communicate over a private VLAN @1GigE. Messages are < 1KB in size. How would you coordinate which services on which host should receive and relay messages to connected clients for state change messages? There's an infinite number of ways to solve this problem efficiently. AMQP (RabbitMQ, ZeroMQ, etc.) Spread Toolkit N^2 connections between allservices (bad) Heck, even run IRC! ... I'm looking for a solution that: perhaps exploits the fact that there's only a small closed cluster is easy to admin scales well is "dumb" (no weird edge cases) What are your experiences? What do you recommend? Thanks!

    Read the article

  • Scaling databases with cheap SSD hard drives

    - by Dennis Kashkin
    Hey guys! I hope that many of you are working with high traffic database-driven websites, and chances are that your main scalability issues are in the database. I noticed a couple of things lately: Most large databases require a team of DBAs in order to scale. They constantly struggle with limitations of hard drives and end up with very expensive solutions (SANs or large RAIDs, frequent maintenance windows for defragging and repartitioning, etc.) The actual annual cost of maintaining such databases is in $100K-$1M range which is too steep for me :) Finally, we got several companies like Intel, Samsung, FusionIO, etc. that just started selling extremely fast yet affordable SSD hard drives based on SLC Flash technology. These drives are 100 times faster in random read/writes than the best spinning hard drives on the market (up to 50,000 random writes per second). Their seek time is pretty much zero, so the cost of random I/O is the same as sequential I/O, which is awesome for databases. These SSD drives cost around $10-$20 per gigabyte, and they are relatively small (64GB). So, there seems to be an opportunity to avoid the HUGE costs of scaling databases the traditional way by simply building a big enough RAID 5 array of SSD drives (which would cost only a few thousand dollars). Then we don't care if the database file is fragmented, and we can afford 100 times more disk writes per second without having to spread the database across 100 spindles. . Is anybody else interested in this? I've been testing a few SSD drives and can share my results. If anybody on this site has already solved their I/O bottleneck with SSDs, I would love to hear your war stories! PS. I know that there are plenty of expensive solutions out there that help with scalability, for example the time proven RAM-based SANs. I want to be clear that even $50K is too expensive for my project. I have to find a solution that costs no more than $10K and does not take much time to implement.

    Read the article

  • Transferring an SQL Processor License to a virtual hosted environment

    - by Andrew Shepherd
    My company is currently hosting a service in-house, and we want to move to an externally hosted environment. We would then be using a virtual server. I understand that this might be spread across multiple machines, but from my perspective as a customer, this layer is abstracted away - I shouldn't know or care about the hardware that the OS is hosted on. We have a licensed edition of SQL Server 2008. This is one Processor license. Will it be a violation of the licensing agreement to use this in a virtual environment. From the reference guide here it says When licensed Per Processor With Workgroup, Web, and Standard editions, for each server to which you have assigned the required number of per processor licenses, you may run, at any one time, any number of instances of the server software in physical and virtual operating system environments on the licensed server. However, the total number of physical and virtual processors used by those operating system environments cannot exceed the number of software licenses assigned to that server For enterprise edition there is an added option: if all physical processors in a machine have been licensed, then you may run unlimited instances of SQL server 2008 in one physical and an unlimited number of virtual operating environments on that same machine. I'm having trouble getting my head around this. Would I theoretically have to get a license for every processor in this virtual environment (which is effectively impossible because I have no way of knowing how many processors there actually are)? Or can I just say that it's hosted on one "virtual" server, so that's OK?

    Read the article

  • Random and Selective ARP blindness in VMWare ESXi 4.1

    - by Peter Grace
    We have multiple VMWare ESX servers spread out amongst our company, doing various tasks. One particular ESXi host is exhibiting very peculiar behavior. We detect it when our monitoring system (Orion) notifies us that it can no longer ping the box. Upon jumping on the local console of the guest in question, we see that it cannot ping any new addresses that aren't already in its ARP table. At first we thought that the problem was just related to one of our guests, as the problem seemed to always happen to another guest, DevRedis. However, this afternoon the problem swapped and started happening on ApacheBox rather than DevRedis. When I have been fortunate to catch the problem, I have run tcpdump on both sides of the connection (one side being vmware, the other side being a physical webserver) and have noticed the following course of events: Guest ApacheBox sends an ARP request for the physical address of server WindowsBeast WindowsBeast tenders an ARP is-at back to the network indicating its physical mac address. ApacheBox never sees the ARP is-at response. The ESX host in question is running VMware ESXi, 4.1.0, 348481 The two guests (DevRedis and ApacheBox) are both running CentOS 6.3, however they are running two separate kernel versions ( 2.6.32-279.9.1.el6.x86_64 and 2.6.32-279.el6.x86_64 ) so I'm not entirely sure it's a CentOS problem. Does anyone have any thoughts on what might cause this? Has anyone run into it before?

    Read the article

  • Can Octopussy use messages other than syslog style?

    - by Lee Lowder
    I am currently exploring different options for a centralized log server. We use both Linux (Ubuntu 10.04 / 12.04, LTS for both) and Windows, though for this specific issue only Linux is relevant. I like the interface that octopussy has and it's feature list, but I am hesitant due to a few things. One of the biggest concerns I have is that it seems to be syslog only. The end goal is to have a centralized place for our devs and admins to be able to search through the logs generated by Apache, Tomcat and 70+ web apps spread out among a cluster, for both our prod and test environments. While I did see that octopussy has support for plugins, I haven't been able to find any sort of plugin repo or in depth guides as to what can be done with them. Does anyone know if plugins can be used to allow octopussy to non-syslog messages? Specifically log4j type log messages that may include multi-line stack traces and such. Also, is there a user community for this software, such as a mailing list or forum? I've been unable to locate any so far. Thank you.

    Read the article

  • Solaris 10: How to remove devices from a zpool with /usr currently mounted?

    - by cali-spc
    I use Solaris 10 on SPARC. I have /usr legacy mounted on a zpool 'usr-pool'. I now need to move some of the devices in usr-pool to another zpool which is running out of room. What is the safest way for me to do this? I already know that (since my zpool is not mirrored) I need to destroy and recreate the zpool. I know how to backup and restore a zfs snapshot. However... I'm stumped on how to unmount usr-pool without losing access to the commands I need on /usr to complete the backup/restore. Cursory research indicated that I should boot to OpenBoot (init 0) and then 'boot cdrom -s'. I did this but none of the zpools are accessible on that runlevel. I also read I could just copy /usr to another location, symlink /usr to that location, then do my backup/restore. Is that safe to do? I would appreciate some guidance. S.

    Read the article

  • Setting up test an dlive enviornment - how?

    - by Sean
    I am a bit new to servers and stuff so had a question. I have my development team working on my website. They are in different countries and currently they put all the work live on the test site. But the test site is open to anyone who knows the URL. It is behind a directory but this effects my QA process because i cannot use the accurate URL structures to prevent the general public from seeing it. So what I want to do it: Have my site live on the net but only for me and my team, so like an internal network. Also I will need to mirror this to my live site when i put it live. So i guess this is something like setting up a staging and live environment. So how to do it and are both environments on the same physical server or do i need to buy two servers? And if i setup a staging environment how will i access it and my team since we are all spread out so i assume we need to log into something to access it? What about the URL - do i need a different URL for the test site or can i use the same live url for the test site? I plan to get a dedicated server + CDN for my site.

    Read the article

  • AMD processors witn graphics card bundled [closed]

    - by shybovycha
    Sorry for posting this question here - just don't know to which StackExchange website i should be writing. I've heard AMD created processors with video card bundled. So now these processors should work as fast as just usual processors with discrete video card but AMD's ones should use less power and spread less warm. Some googling around gave me the result like "AMD processors of A-series". They were mentioned to be build using that technology i described above. But on the other hand, we have a small rate of publishing and not-very-good quality of AMD drivers. I am a game-developer and web-developer so i need a powerful processor and graphics card and a lot of RAM on board (to make it possible to create a sample Grails application, for example or to create some 3D models in Maya/Cinema4D, for instance). Still i want my battery to be a long-living one, so power usage is a bit critical for me. So, my questions are: are there any processor building technology like i've described and which series they are (if they exist)? which processor shall fit the laptop the best: AMD one or i5/i7 on with nVidia graphics card for the purposes mentioned above?

    Read the article

  • Can my employer force me to backup my personal machine? [closed]

    - by Eric B
    Here's the background: Approximately 1.25 years ago, the company I work for was acquired by a larger 400 person company. Before acquisition (and today still) we are all remote employees using our own personal hardware for work-related duties (coding, email, etc). We are approximately 15 employees within the larger organization. Some time after acquisition, the now owning company was slapped with a civil lawsuit. Part of this lawsuit (discovery) is requiring them to retrieve & store from us any related information. Because we were a separate company up until acquisition, there is a high probability that our personal machines might contain information about what the lawsuit alleges (email, documents, chat logs?, etc). Obviously, this depends largely on the person's job function (engineer vs. customer support vs. CEO). All employees are being required to comply. Since acquisition (1.25 yrs), the new company has not provided us with company laptops/desktops. We continue to use personal hardware, licenses, etc for work. Email is via POP3s and not hanging around on the mail server - it's on everyone's client. Documents are spread across personal machines. So, now they want us each to backup our complete personal machines. They are allowing us to create a "personal" folder where we can place personal documents. That single folder will be excluded from backup. Of course, that means total re-arrangement of documents, etc. For most of us, 99% of the data on the machine is NOT related to work. So, what's the consensus? Should we comply? What is their recourse if we do not?

    Read the article

  • Concatenating gziped Apache logs

    - by markdrayton
    We rotate and compress our Apache logs each day but it's become apparent that this isn't frequently enough. An uncompressed log is about 6G, which is getting close to filling our log partition (yep, we'll make it bigger in the future!) as well as taking a lot of time and CPU to compress each day. We have to produce a gziped log for each day for our stats processing. Obviously we could move our logs to a partition with more space but I also want to spread the compression overhead throughout the day. Using Apache's rotatelogs we can rotate and compress the log more often -- hourly, say -- but how can I concatenate all the hourly compressed logs into a running compressed log for the day, without decompressing the previous logs? I don't want to uncompress 24 hours' worth of data and recompress it because that has all the disadvantages of our current solution. Gzip doesn't seem to offer any append or concatenate option but perhaps I've missed something obvious. This question suggests straight shell concatenation "works" in that the archive can be decompressed but that gzip -l doesn't work seems a bit dodgy. Alternatively, perhaps this is still a bad way to do things. Other suggestions are welcome -- our only constraints are our relatively small log partitions and the need to provide a daily compressed log.

    Read the article

  • Locate devices within a building

    - by ams0
    The situation: Our company is spread between two floors in a building. Every employee has a laptop (macbook Air or MacbookPro) and an iPhone. We have static DHCP mappings and DNS resolution so every mobile gets a name like employeeiphone.example.com, every macbook air gets a employeelaptop.example.com and every macbook pro gets a employeelaptop.example.com on the Ethernet interface (the wifi gets a dynamic IP from a small range dedicated for the purpose). We know each and every MAC address of phones and laptops, since we do DHCP static mapping (ISC DHCP server runs on linux). At each floor we have a Netgear stack of two switches, connected via 10GB fiber to each other. No VLANs so far. At every floor there are 4 Airport Extreme making a single SSID network with WPA2 authentication. The request: Our CTO wants to know who is present at which floor. My solution (so far): Every switch contains an table listing MAC address and originating port. On each switch stack, all the MAC addresses coming from the other floor are listed as coming on port 48 (the fiber link). So I came up with: 1) Get the table from each switch via SNMP 2) Filter out the ones associated with port 48 3) Grep dhcpd.conf, removing all entries not *laptop and not *iphone 4) Match the two lists for each switch, output in JSON or XML 5) present the results on a dashboard for all to see I wrote it in bash with a lot of awk and sed, it kinda works but I always have for some reason stale entries in the switch lookup tables, making it unreliable; some people may have put their laptop to sleep, their iphones drop connections after a while, if not woken up and so on..I searched left and right, we are prepared to spend a little on the project too (RFIDs?), does anybody do something similar? I can provide with the script if needed (although it's really specific to our switches and naming scheme). Thanks! p.s. perhaps is this a question for stackoverflow? please move if it so.

    Read the article

  • How to keep multiple servers in sync file wise?

    - by GForceSys
    I'm currently managing a cluster of PHP-FPM servers, all of which tend to get out of sync with each other. The application that I'm using on top of the app servers (Magento) allows for admins to modify various files on the system, but now that the site is in a clustered set up modifying a file only modifies it on a single instance (on one of the app servers) of the various machines in the cluster. Is there an open-source application for Linux that may allow me to keep all of these servers in sync? I have no problem with creating a small VM instance that can listen for changes from machines to sync. In theory, the perfect application would have small clients that run on each machine to be synced, which would talk to the master server which would then decide how/what to sync from each machine. I have already examined the possibilities of running a centralized file server, but unfortunately my app servers are spread out between EC2 and physical machines, which makes this unfeasible. As there are multiple app servers (some of which are dynamically created depending on the load of the site), simply setting up a rsync cron job is not efficient as the cron job would have to be modified on each machine to send files to every other machine in the cluster, and that would just be a whole bunch of unnecessary data transfers/ssh connections.

    Read the article

  • how to build network across buildings ?

    - by Omie
    Hi ! I need some help in building a network between hundreds of computers spread across multiple buildings of my college. Yes, I'll be doing this as a part of my college project. Please see this image, it will give you enough idea of what I'm trying to achieve. http://i.imgur.com/rOohx.png All the computers in all buildings should be able to connect server. Once network is up, there will be a set of services over intranet and network use will be moderate. well, say there will be an email server and a http server. My point is, I cannot afford much of performance loss. It feels easy to connect computers inside 1 building to each other, however, I'm clueless as to how to connect all of them to server. I mean, just 1 cable won't be enough to connect 1 building to server, right ? How should I go with it ? I am not expecting detailed configuration. Just heads up will do :) Thanks

    Read the article

  • what is best multi-server configuration with OpenVPN

    - by sebut
    We have a number of Database severs running MongoDB on Debian plus a number of Application servers also on Debian. The db servers hold replicating db clusters, so they need to talk to each other. Application servers need to talk to all db servers (for reasons of fault tolerance). The servers are potentially spread across multiple hosting centers, so we need secure channels between all servers. The number of servers is bound to grow, so we need a VPN solution that's easy to maintain and expand. This is why I feel that SSH that we use for testing might not be up to the task and OpenVPN seems the way to go. I have ruled out TAP, since I understand that this would mean all traffic going to all the servers - perhaps this is a misunderstanding and TAP acts more like a switch? With TUN devices I imagine that all DB servers would live in their own separate subnet, they would also need a client configured to be able to connect to each of their peers. The application servers could live in a common subnet range with a client config only. Does this sound like a reasonable setup? Strangely, on the web I did not find anything about multi-server with OpenVPN. Thanks for all insights and ideas!

    Read the article

  • Using R to Analyze G1GC Log Files

    - by user12620111
    Using R to Analyze G1GC Log Files body, td { font-family: sans-serif; background-color: white; font-size: 12px; margin: 8px; } tt, code, pre { font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace; } h1 { font-size:2.2em; } h2 { font-size:1.8em; } h3 { font-size:1.4em; } h4 { font-size:1.0em; } h5 { font-size:0.9em; } h6 { font-size:0.8em; } a:visited { color: rgb(50%, 0%, 50%); } pre { margin-top: 0; max-width: 95%; border: 1px solid #ccc; white-space: pre-wrap; } pre code { display: block; padding: 0.5em; } code.r, code.cpp { background-color: #F8F8F8; } table, td, th { border: none; } blockquote { color:#666666; margin:0; padding-left: 1em; border-left: 0.5em #EEE solid; } hr { height: 0px; border-bottom: none; border-top-width: thin; border-top-style: dotted; border-top-color: #999999; } @media print { * { background: transparent !important; color: black !important; filter:none !important; -ms-filter: none !important; } body { font-size:12pt; max-width:100%; } a, a:visited { text-decoration: underline; } hr { visibility: hidden; page-break-before: always; } pre, blockquote { padding-right: 1em; page-break-inside: avoid; } tr, img { page-break-inside: avoid; } img { max-width: 100% !important; } @page :left { margin: 15mm 20mm 15mm 10mm; } @page :right { margin: 15mm 10mm 15mm 20mm; } p, h2, h3 { orphans: 3; widows: 3; } h2, h3 { page-break-after: avoid; } } pre .operator, pre .paren { color: rgb(104, 118, 135) } pre .literal { color: rgb(88, 72, 246) } pre .number { color: rgb(0, 0, 205); } pre .comment { color: rgb(76, 136, 107); } pre .keyword { color: rgb(0, 0, 255); } pre .identifier { color: rgb(0, 0, 0); } pre .string { color: rgb(3, 106, 7); } var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/"}while(y.length||w.length){var v=u().splice(0,1)[0];z+=m(x.substr(q,v.offset-q));q=v.offset;if(v.event=="start"){z+=t(v.node);s.push(v.node)}else{if(v.event=="stop"){var p,r=s.length;do{r--;p=s[r];z+=("")}while(p!=v.node);s.splice(r,1);while(r'+M[0]+""}else{r+=M[0]}O=P.lR.lastIndex;M=P.lR.exec(L)}return r+L.substr(O,L.length-O)}function J(L,M){if(M.sL&&e[M.sL]){var r=d(M.sL,L);x+=r.keyword_count;return r.value}else{return F(L,M)}}function I(M,r){var L=M.cN?'':"";if(M.rB){y+=L;M.buffer=""}else{if(M.eB){y+=m(r)+L;M.buffer=""}else{y+=L;M.buffer=r}}D.push(M);A+=M.r}function G(N,M,Q){var R=D[D.length-1];if(Q){y+=J(R.buffer+N,R);return false}var P=q(M,R);if(P){y+=J(R.buffer+N,R);I(P,M);return P.rB}var L=v(D.length-1,M);if(L){var O=R.cN?"":"";if(R.rE){y+=J(R.buffer+N,R)+O}else{if(R.eE){y+=J(R.buffer+N,R)+O+m(M)}else{y+=J(R.buffer+N+M,R)+O}}while(L1){O=D[D.length-2].cN?"":"";y+=O;L--;D.length--}var r=D[D.length-1];D.length--;D[D.length-1].buffer="";if(r.starts){I(r.starts,"")}return R.rE}if(w(M,R)){throw"Illegal"}}var E=e[B];var D=[E.dM];var A=0;var x=0;var y="";try{var s,u=0;E.dM.buffer="";do{s=p(C,u);var t=G(s[0],s[1],s[2]);u+=s[0].length;if(!t){u+=s[1].length}}while(!s[2]);if(D.length1){throw"Illegal"}return{r:A,keyword_count:x,value:y}}catch(H){if(H=="Illegal"){return{r:0,keyword_count:0,value:m(C)}}else{throw H}}}function g(t){var p={keyword_count:0,r:0,value:m(t)};var r=p;for(var q in e){if(!e.hasOwnProperty(q)){continue}var s=d(q,t);s.language=q;if(s.keyword_count+s.rr.keyword_count+r.r){r=s}if(s.keyword_count+s.rp.keyword_count+p.r){r=p;p=s}}if(r.language){p.second_best=r}return p}function i(r,q,p){if(q){r=r.replace(/^((]+|\t)+)/gm,function(t,w,v,u){return w.replace(/\t/g,q)})}if(p){r=r.replace(/\n/g,"")}return r}function n(t,w,r){var x=h(t,r);var v=a(t);var y,s;if(v){y=d(v,x)}else{return}var q=c(t);if(q.length){s=document.createElement("pre");s.innerHTML=y.value;y.value=k(q,c(s),x)}y.value=i(y.value,w,r);var u=t.className;if(!u.match("(\\s|^)(language-)?"+v+"(\\s|$)")){u=u?(u+" "+v):v}if(/MSIE [678]/.test(navigator.userAgent)&&t.tagName=="CODE"&&t.parentNode.tagName=="PRE"){s=t.parentNode;var p=document.createElement("div");p.innerHTML=""+y.value+"";t=p.firstChild.firstChild;p.firstChild.cN=s.cN;s.parentNode.replaceChild(p.firstChild,s)}else{t.innerHTML=y.value}t.className=u;t.result={language:v,kw:y.keyword_count,re:y.r};if(y.second_best){t.second_best={language:y.second_best.language,kw:y.second_best.keyword_count,re:y.second_best.r}}}function o(){if(o.called){return}o.called=true;var r=document.getElementsByTagName("pre");for(var p=0;p|=||=||=|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.ER="(?![\\s\\S])";this.BE={b:"\\\\.",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.inherit=function(r,s){var p={};for(var q in r){p[q]=r[q]}if(s){for(var q in s){p[q]=s[q]}}return p}}();hljs.LANGUAGES.cpp=function(){var a={keyword:{"false":1,"int":1,"float":1,"while":1,"private":1,"char":1,"catch":1,"export":1,virtual:1,operator:2,sizeof:2,dynamic_cast:2,typedef:2,const_cast:2,"const":1,struct:1,"for":1,static_cast:2,union:1,namespace:1,unsigned:1,"long":1,"throw":1,"volatile":2,"static":1,"protected":1,bool:1,template:1,mutable:1,"if":1,"public":1,friend:2,"do":1,"return":1,"goto":1,auto:1,"void":2,"enum":1,"else":1,"break":1,"new":1,extern:1,using:1,"true":1,"class":1,asm:1,"case":1,typeid:1,"short":1,reinterpret_cast:2,"default":1,"double":1,register:1,explicit:1,signed:1,typename:1,"try":1,"this":1,"switch":1,"continue":1,wchar_t:1,inline:1,"delete":1,alignof:1,char16_t:1,char32_t:1,constexpr:1,decltype:1,noexcept:1,nullptr:1,static_assert:1,thread_local:1,restrict:1,_Bool:1,complex:1},built_in:{std:1,string:1,cin:1,cout:1,cerr:1,clog:1,stringstream:1,istringstream:1,ostringstream:1,auto_ptr:1,deque:1,list:1,queue:1,stack:1,vector:1,map:1,set:1,bitset:1,multiset:1,multimap:1,unordered_set:1,unordered_map:1,unordered_multiset:1,unordered_multimap:1,array:1,shared_ptr:1}};return{dM:{k:a,i:"",k:a,r:10,c:["self"]}]}}}();hljs.LANGUAGES.r={dM:{c:[hljs.HCM,{cN:"number",b:"\\b0[xX][0-9a-fA-F]+[Li]?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+(?:[eE][+\\-]?\\d*)?L\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+\\.(?!\\d)(?:i\\b)?",e:hljs.IMMEDIATE_RE,r:1},{cN:"number",b:"\\b\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"keyword",b:"(?:tryCatch|library|setGeneric|setGroupGeneric)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\.",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\d+(?![\\w.])",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\b(?:function)",e:hljs.IMMEDIATE_RE,r:2},{cN:"keyword",b:"(?:if|in|break|next|repeat|else|for|return|switch|while|try|stop|warning|require|attach|detach|source|setMethod|setClass)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"literal",b:"(?:NA|NA_integer_|NA_real_|NA_character_|NA_complex_)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"literal",b:"(?:NULL|TRUE|FALSE|T|F|Inf|NaN)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"identifier",b:"[a-zA-Z.][a-zA-Z0-9._]*\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"|=||   Using R to Analyze G1GC Log Files   Using R to Analyze G1GC Log Files Introduction Working in Oracle Platform Integration gives an engineer opportunities to work on a wide array of technologies. My team’s goal is to make Oracle applications run best on the Solaris/SPARC platform. When looking for bottlenecks in a modern applications, one needs to be aware of not only how the CPUs and operating system are executing, but also network, storage, and in some cases, the Java Virtual Machine. I was recently presented with about 1.5 GB of Java Garbage First Garbage Collector log file data. If you’re not familiar with the subject, you might want to review Garbage First Garbage Collector Tuning by Monica Beckwith. The customer had been running Java HotSpot 1.6.0_31 to host a web application server. I was told that the Solaris/SPARC server was running a Java process launched using a commmand line that included the following flags: -d64 -Xms9g -Xmx9g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=80 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps -XX:+PrintFlagsFinal -XX:+DisableExplicitGC -XX:+UnlockExperimentalVMOptions -XX:ParallelGCThreads=8 Several sources on the internet indicate that if I were to print out the 1.5 GB of log files, it would require enough paper to fill the bed of a pick up truck. Of course, it would be fruitless to try to scan the log files by hand. Tools will be required to summarize the contents of the log files. Others have encountered large Java garbage collection log files. There are existing tools to analyze the log files: IBM’s GC toolkit The chewiebug GCViewer gchisto HPjmeter Instead of using one of the other tools listed, I decide to parse the log files with standard Unix tools, and analyze the data with R. Data Cleansing The log files arrived in two different formats. I guess that the difference is that one set of log files was generated using a more verbose option, maybe -XX:+PrintHeapAtGC, and the other set of log files was generated without that option. Format 1 In some of the log files, the log files with the less verbose format, a single trace, i.e. the report of a singe garbage collection event, looks like this: {Heap before GC invocations=12280 (full 61): garbage-first heap total 9437184K, used 7499918K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000) region size 4096K, 1 young (4096K), 0 survivors (0K) compacting perm gen total 262144K, used 144077K [0xffffffff40000000, 0xffffffff50000000, 0xffffffff50000000) the space 262144K, 54% used [0xffffffff40000000, 0xffffffff48cb3758, 0xffffffff48cb3800, 0xffffffff50000000) No shared spaces configured. 2014-05-14T07:24:00.988-0700: 60586.353: [GC pause (young) 7324M->7320M(9216M), 0.1567265 secs] Heap after GC invocations=12281 (full 61): garbage-first heap total 9437184K, used 7496533K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000) region size 4096K, 0 young (0K), 0 survivors (0K) compacting perm gen total 262144K, used 144077K [0xffffffff40000000, 0xffffffff50000000, 0xffffffff50000000) the space 262144K, 54% used [0xffffffff40000000, 0xffffffff48cb3758, 0xffffffff48cb3800, 0xffffffff50000000) No shared spaces configured. } A simple grep can be used to extract a summary: $ grep "\[ GC pause (young" g1gc.log 2014-05-13T13:24:35.091-0700: 3.109: [GC pause (young) 20M->5029K(9216M), 0.0146328 secs] 2014-05-13T13:24:35.440-0700: 3.459: [GC pause (young) 9125K->6077K(9216M), 0.0086723 secs] 2014-05-13T13:24:37.581-0700: 5.599: [GC pause (young) 25M->8470K(9216M), 0.0203820 secs] 2014-05-13T13:24:42.686-0700: 10.704: [GC pause (young) 44M->15M(9216M), 0.0288848 secs] 2014-05-13T13:24:48.941-0700: 16.958: [GC pause (young) 51M->20M(9216M), 0.0491244 secs] 2014-05-13T13:24:56.049-0700: 24.066: [GC pause (young) 92M->26M(9216M), 0.0525368 secs] 2014-05-13T13:25:34.368-0700: 62.383: [GC pause (young) 602M->68M(9216M), 0.1721173 secs] But that format wasn't easily read into R, so I needed to be a bit more tricky. I used the following Unix command to create a summary file that was easy for R to read. $ echo "SecondsSinceLaunch BeforeSize AfterSize TotalSize RealTime" $ grep "\[GC pause (young" g1gc.log | grep -v mark | sed -e 's/[A-SU-z\(\),]/ /g' -e 's/->/ /' -e 's/: / /g' | more SecondsSinceLaunch BeforeSize AfterSize TotalSize RealTime 2014-05-13T13:24:35.091-0700 3.109 20 5029 9216 0.0146328 2014-05-13T13:24:35.440-0700 3.459 9125 6077 9216 0.0086723 2014-05-13T13:24:37.581-0700 5.599 25 8470 9216 0.0203820 2014-05-13T13:24:42.686-0700 10.704 44 15 9216 0.0288848 2014-05-13T13:24:48.941-0700 16.958 51 20 9216 0.0491244 2014-05-13T13:24:56.049-0700 24.066 92 26 9216 0.0525368 2014-05-13T13:25:34.368-0700 62.383 602 68 9216 0.1721173 Format 2 In some of the log files, the log files with the more verbose format, a single trace, i.e. the report of a singe garbage collection event, was more complicated than Format 1. Here is a text file with an example of a single G1GC trace in the second format. As you can see, it is quite complicated. It is nice that there is so much information available, but the level of detail can be overwhelming. I wrote this awk script (download) to summarize each trace on a single line. #!/usr/bin/env awk -f BEGIN { printf("SecondsSinceLaunch IncrementalCount FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize\n") } ###################### # Save count data from lines that are at the start of each G1GC trace. # Each trace starts out like this: # {Heap before GC invocations=14 (full 0): # garbage-first heap total 9437184K, used 325496K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000) ###################### /{Heap.*full/{ gsub ( "\\)" , "" ); nf=split($0,a,"="); split(a[2],b," "); getline; if ( match($0, "first") ) { G1GC=1; IncrementalCount=b[1]; FullCount=substr( b[3], 1, length(b[3])-1 ); } else { G1GC=0; } } ###################### # Pull out time stamps that are in lines with this format: # 2014-05-12T14:02:06.025-0700: 94.312: [GC pause (young), 0.08870154 secs] ###################### /GC pause/ { DateTime=$1; SecondsSinceLaunch=substr($2, 1, length($2)-1); } ###################### # Heap sizes are in lines that look like this: # [ 4842M->4838M(9216M)] ###################### /\[ .*]$/ { gsub ( "\\[" , "" ); gsub ( "\ \]" , "" ); gsub ( "->" , " " ); gsub ( "\\( " , " " ); gsub ( "\ \)" , " " ); split($0,a," "); if ( split(a[1],b,"M") > 1 ) {BeforeSize=b[1]*1024;} if ( split(a[1],b,"K") > 1 ) {BeforeSize=b[1];} if ( split(a[2],b,"M") > 1 ) {AfterSize=b[1]*1024;} if ( split(a[2],b,"K") > 1 ) {AfterSize=b[1];} if ( split(a[3],b,"M") > 1 ) {TotalSize=b[1]*1024;} if ( split(a[3],b,"K") > 1 ) {TotalSize=b[1];} } ###################### # Emit an output line when you find input that looks like this: # [Times: user=1.41 sys=0.08, real=0.24 secs] ###################### /\[Times/ { if (G1GC==1) { gsub ( "," , "" ); split($2,a,"="); UserTime=a[2]; split($3,a,"="); SysTime=a[2]; split($4,a,"="); RealTime=a[2]; print DateTime,SecondsSinceLaunch,IncrementalCount,FullCount,UserTime,SysTime,RealTime,BeforeSize,AfterSize,TotalSize; G1GC=0; } } The resulting summary is about 25X smaller that the original file, but still difficult for a human to digest. SecondsSinceLaunch IncrementalCount FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize ... 2014-05-12T18:36:34.669-0700: 3985.744 561 0 0.57 0.06 0.16 1724416 1720320 9437184 2014-05-12T18:36:34.839-0700: 3985.914 562 0 0.51 0.06 0.19 1724416 1720320 9437184 2014-05-12T18:36:35.069-0700: 3986.144 563 0 0.60 0.04 0.27 1724416 1721344 9437184 2014-05-12T18:36:35.354-0700: 3986.429 564 0 0.33 0.04 0.09 1725440 1722368 9437184 2014-05-12T18:36:35.545-0700: 3986.620 565 0 0.58 0.04 0.17 1726464 1722368 9437184 2014-05-12T18:36:35.726-0700: 3986.801 566 0 0.43 0.05 0.12 1726464 1722368 9437184 2014-05-12T18:36:35.856-0700: 3986.930 567 0 0.30 0.04 0.07 1726464 1723392 9437184 2014-05-12T18:36:35.947-0700: 3987.023 568 0 0.61 0.04 0.26 1727488 1723392 9437184 2014-05-12T18:36:36.228-0700: 3987.302 569 0 0.46 0.04 0.16 1731584 1724416 9437184 Reading the Data into R Once the GC log data had been cleansed, either by processing the first format with the shell script, or by processing the second format with the awk script, it was easy to read the data into R. g1gc.df = read.csv("summary.txt", row.names = NULL, stringsAsFactors=FALSE,sep="") str(g1gc.df) ## 'data.frame': 8307 obs. of 10 variables: ## $ row.names : chr "2014-05-12T14:00:32.868-0700:" "2014-05-12T14:00:33.179-0700:" "2014-05-12T14:00:33.677-0700:" "2014-05-12T14:00:35.538-0700:" ... ## $ SecondsSinceLaunch: num 1.16 1.47 1.97 3.83 6.1 ... ## $ IncrementalCount : int 0 1 2 3 4 5 6 7 8 9 ... ## $ FullCount : int 0 0 0 0 0 0 0 0 0 0 ... ## $ UserTime : num 0.11 0.05 0.04 0.21 0.08 0.26 0.31 0.33 0.34 0.56 ... ## $ SysTime : num 0.04 0.01 0.01 0.05 0.01 0.06 0.07 0.06 0.07 0.09 ... ## $ RealTime : num 0.02 0.02 0.01 0.04 0.02 0.04 0.05 0.04 0.04 0.06 ... ## $ BeforeSize : int 8192 5496 5768 22528 24576 43008 34816 53248 55296 93184 ... ## $ AfterSize : int 1400 1672 2557 4907 7072 14336 16384 18432 19456 21504 ... ## $ TotalSize : int 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 ... head(g1gc.df) ## row.names SecondsSinceLaunch IncrementalCount ## 1 2014-05-12T14:00:32.868-0700: 1.161 0 ## 2 2014-05-12T14:00:33.179-0700: 1.472 1 ## 3 2014-05-12T14:00:33.677-0700: 1.969 2 ## 4 2014-05-12T14:00:35.538-0700: 3.830 3 ## 5 2014-05-12T14:00:37.811-0700: 6.103 4 ## 6 2014-05-12T14:00:41.428-0700: 9.720 5 ## FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize ## 1 0 0.11 0.04 0.02 8192 1400 9437184 ## 2 0 0.05 0.01 0.02 5496 1672 9437184 ## 3 0 0.04 0.01 0.01 5768 2557 9437184 ## 4 0 0.21 0.05 0.04 22528 4907 9437184 ## 5 0 0.08 0.01 0.02 24576 7072 9437184 ## 6 0 0.26 0.06 0.04 43008 14336 9437184 Basic Statistics Once the data has been read into R, simple statistics are very easy to generate. All of the numbers from high school statistics are available via simple commands. For example, generate a summary of every column: summary(g1gc.df) ## row.names SecondsSinceLaunch IncrementalCount FullCount ## Length:8307 Min. : 1 Min. : 0 Min. : 0.0 ## Class :character 1st Qu.: 9977 1st Qu.:2048 1st Qu.: 0.0 ## Mode :character Median :12855 Median :4136 Median : 12.0 ## Mean :12527 Mean :4156 Mean : 31.6 ## 3rd Qu.:15758 3rd Qu.:6262 3rd Qu.: 61.0 ## Max. :55484 Max. :8391 Max. :113.0 ## UserTime SysTime RealTime BeforeSize ## Min. :0.040 Min. :0.0000 Min. : 0.0 Min. : 5476 ## 1st Qu.:0.470 1st Qu.:0.0300 1st Qu.: 0.1 1st Qu.:5137920 ## Median :0.620 Median :0.0300 Median : 0.1 Median :6574080 ## Mean :0.751 Mean :0.0355 Mean : 0.3 Mean :5841855 ## 3rd Qu.:0.920 3rd Qu.:0.0400 3rd Qu.: 0.2 3rd Qu.:7084032 ## Max. :3.370 Max. :1.5600 Max. :488.1 Max. :8696832 ## AfterSize TotalSize ## Min. : 1380 Min. :9437184 ## 1st Qu.:5002752 1st Qu.:9437184 ## Median :6559744 Median :9437184 ## Mean :5785454 Mean :9437184 ## 3rd Qu.:7054336 3rd Qu.:9437184 ## Max. :8482816 Max. :9437184 Q: What is the total amount of User CPU time spent in garbage collection? sum(g1gc.df$UserTime) ## [1] 6236 As you can see, less than two hours of CPU time was spent in garbage collection. Is that too much? To find the percentage of time spent in garbage collection, divide the number above by total_elapsed_time*CPU_count. In this case, there are a lot of CPU’s and it turns out the the overall amount of CPU time spent in garbage collection isn’t a problem when viewed in isolation. When calculating rates, i.e. events per unit time, you need to ask yourself if the rate is homogenous across the time period in the log file. Does the log file include spikes of high activity that should be separately analyzed? Averaging in data from nights and weekends with data from business hours may alias problems. If you have a reason to suspect that the garbage collection rates include peaks and valleys that need independent analysis, see the “Time Series” section, below. Q: How much garbage is collected on each pass? The amount of heap space that is recovered per GC pass is surprisingly low: At least one collection didn’t recover any data. (“Min.=0”) 25% of the passes recovered 3MB or less. (“1st Qu.=3072”) Half of the GC passes recovered 4MB or less. (“Median=4096”) The average amount recovered was 56MB. (“Mean=56390”) 75% of the passes recovered 36MB or less. (“3rd Qu.=36860”) At least one pass recovered 2GB. (“Max.=2121000”) g1gc.df$Delta = g1gc.df$BeforeSize - g1gc.df$AfterSize summary(g1gc.df$Delta) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 3070 4100 56400 36900 2120000 Q: What is the maximum User CPU time for a single collection? The worst garbage collection (“Max.”) is many standard deviations away from the mean. The data appears to be right skewed. summary(g1gc.df$UserTime) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.040 0.470 0.620 0.751 0.920 3.370 sd(g1gc.df$UserTime) ## [1] 0.3966 Basic Graphics Once the data is in R, it is trivial to plot the data with formats including dot plots, line charts, bar charts (simple, stacked, grouped), pie charts, boxplots, scatter plots histograms, and kernel density plots. Histogram of User CPU Time per Collection I don't think that this graph requires any explanation. hist(g1gc.df$UserTime, main="User CPU Time per Collection", xlab="Seconds", ylab="Frequency") Box plot to identify outliers When the initial data is viewed with a box plot, you can see the one crazy outlier in the real time per GC. Save this data point for future analysis and drop the outlier so that it’s not throwing off our statistics. Now the box plot shows many outliers, which will be examined later, using times series analysis. Notice that the scale of the x-axis changes drastically once the crazy outlier is removed. par(mfrow=c(2,1)) boxplot(g1gc.df$UserTime,g1gc.df$SysTime,g1gc.df$RealTime, main="Box Plot of Time per GC\n(dominated by a crazy outlier)", names=c("usr","sys","elapsed"), xlab="Seconds per GC", ylab="Time (Seconds)", horizontal = TRUE, outcol="red") crazy.outlier.df=g1gc.df[g1gc.df$RealTime > 400,] g1gc.df=g1gc.df[g1gc.df$RealTime < 400,] boxplot(g1gc.df$UserTime,g1gc.df$SysTime,g1gc.df$RealTime, main="Box Plot of Time per GC\n(crazy outlier excluded)", names=c("usr","sys","elapsed"), xlab="Seconds per GC", ylab="Time (Seconds)", horizontal = TRUE, outcol="red") box(which = "outer", lty = "solid") Here is the crazy outlier for future analysis: crazy.outlier.df ## row.names SecondsSinceLaunch IncrementalCount ## 8233 2014-05-12T23:15:43.903-0700: 20741 8316 ## FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize ## 8233 112 0.55 0.42 488.1 8381440 8235008 9437184 ## Delta ## 8233 146432 R Time Series Data To analyze the garbage collection as a time series, I’ll use Z’s Ordered Observations (zoo). “zoo is the creator for an S3 class of indexed totally ordered observations which includes irregular time series.” require(zoo) ## Loading required package: zoo ## ## Attaching package: 'zoo' ## ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric head(g1gc.df[,1]) ## [1] "2014-05-12T14:00:32.868-0700:" "2014-05-12T14:00:33.179-0700:" ## [3] "2014-05-12T14:00:33.677-0700:" "2014-05-12T14:00:35.538-0700:" ## [5] "2014-05-12T14:00:37.811-0700:" "2014-05-12T14:00:41.428-0700:" options("digits.secs"=3) times=as.POSIXct( g1gc.df[,1], format="%Y-%m-%dT%H:%M:%OS%z:") g1gc.z = zoo(g1gc.df[,-c(1)], order.by=times) head(g1gc.z) ## SecondsSinceLaunch IncrementalCount FullCount ## 2014-05-12 17:00:32.868 1.161 0 0 ## 2014-05-12 17:00:33.178 1.472 1 0 ## 2014-05-12 17:00:33.677 1.969 2 0 ## 2014-05-12 17:00:35.538 3.830 3 0 ## 2014-05-12 17:00:37.811 6.103 4 0 ## 2014-05-12 17:00:41.427 9.720 5 0 ## UserTime SysTime RealTime BeforeSize AfterSize ## 2014-05-12 17:00:32.868 0.11 0.04 0.02 8192 1400 ## 2014-05-12 17:00:33.178 0.05 0.01 0.02 5496 1672 ## 2014-05-12 17:00:33.677 0.04 0.01 0.01 5768 2557 ## 2014-05-12 17:00:35.538 0.21 0.05 0.04 22528 4907 ## 2014-05-12 17:00:37.811 0.08 0.01 0.02 24576 7072 ## 2014-05-12 17:00:41.427 0.26 0.06 0.04 43008 14336 ## TotalSize Delta ## 2014-05-12 17:00:32.868 9437184 6792 ## 2014-05-12 17:00:33.178 9437184 3824 ## 2014-05-12 17:00:33.677 9437184 3211 ## 2014-05-12 17:00:35.538 9437184 17621 ## 2014-05-12 17:00:37.811 9437184 17504 ## 2014-05-12 17:00:41.427 9437184 28672 Example of Two Benchmark Runs in One Log File The data in the following graph is from a different log file, not the one of primary interest to this article. I’m including this image because it is an example of idle periods followed by busy periods. It would be uninteresting to average the rate of garbage collection over the entire log file period. More interesting would be the rate of garbage collect in the two busy periods. Are they the same or different? Your production data may be similar, for example, bursts when employees return from lunch and idle times on weekend evenings, etc. Once the data is in an R Time Series, you can analyze isolated time windows. Clipping the Time Series data Flashing back to our test case… Viewing the data as a time series is interesting. You can see that the work intensive time period is between 9:00 PM and 3:00 AM. Lets clip the data to the interesting period:     par(mfrow=c(2,1)) plot(g1gc.z$UserTime, type="h", main="User Time per GC\nTime: Complete Log File", xlab="Time of Day", ylab="CPU Seconds per GC", col="#1b9e77") clipped.g1gc.z=window(g1gc.z, start=as.POSIXct("2014-05-12 21:00:00"), end=as.POSIXct("2014-05-13 03:00:00")) plot(clipped.g1gc.z$UserTime, type="h", main="User Time per GC\nTime: Limited to Benchmark Execution", xlab="Time of Day", ylab="CPU Seconds per GC", col="#1b9e77") box(which = "outer", lty = "solid") Cumulative Incremental and Full GC count Here is the cumulative incremental and full GC count. When the line is very steep, it indicates that the GCs are repeating very quickly. Notice that the scale on the Y axis is different for full vs. incremental. plot(clipped.g1gc.z[,c(2:3)], main="Cumulative Incremental and Full GC count", xlab="Time of Day", col="#1b9e77") GC Analysis of Benchmark Execution using Time Series data In the following series of 3 graphs: The “After Size” show the amount of heap space in use after each garbage collection. Many Java objects are still referenced, i.e. alive, during each garbage collection. This may indicate that the application has a memory leak, or may indicate that the application has a very large memory footprint. Typically, an application's memory footprint plateau's in the early stage of execution. One would expect this graph to have a flat top. The steep decline in the heap space may indicate that the application crashed after 2:00. The second graph shows that the outliers in real execution time, discussed above, occur near 2:00. when the Java heap seems to be quite full. The third graph shows that Full GCs are infrequent during the first few hours of execution. The rate of Full GC's, (the slope of the cummulative Full GC line), changes near midnight.   plot(clipped.g1gc.z[,c("AfterSize","RealTime","FullCount")], xlab="Time of Day", col=c("#1b9e77","red","#1b9e77")) GC Analysis of heap recovered Each GC trace includes the amount of heap space in use before and after the individual GC event. During garbage coolection, unreferenced objects are identified, the space holding the unreferenced objects is freed, and thus, the difference in before and after usage indicates how much space has been freed. The following box plot and bar chart both demonstrate the same point - the amount of heap space freed per garbage colloection is surprisingly low. par(mfrow=c(2,1)) boxplot(as.vector(clipped.g1gc.z$Delta), main="Amount of Heap Recovered per GC Pass", xlab="Size in KB", horizontal = TRUE, col="red") hist(as.vector(clipped.g1gc.z$Delta), main="Amount of Heap Recovered per GC Pass", xlab="Size in KB", breaks=100, col="red") box(which = "outer", lty = "solid") This graph is the most interesting. The dark blue area shows how much heap is occupied by referenced Java objects. This represents memory that holds live data. The red fringe at the top shows how much data was recovered after each garbage collection. barplot(clipped.g1gc.z[,c("AfterSize","Delta")], col=c("#7570b3","#e7298a"), xlab="Time of Day", border=NA) legend("topleft", c("Live Objects","Heap Recovered on GC"), fill=c("#7570b3","#e7298a")) box(which = "outer", lty = "solid") When I discuss the data in the log files with the customer, I will ask for an explaination for the large amount of referenced data resident in the Java heap. There are two are posibilities: There is a memory leak and the amount of space required to hold referenced objects will continue to grow, limited only by the maximum heap size. After the maximum heap size is reached, the JVM will throw an “Out of Memory” exception every time that the application tries to allocate a new object. If this is the case, the aplication needs to be debugged to identify why old objects are referenced when they are no longer needed. The application has a legitimate requirement to keep a large amount of data in memory. The customer may want to further increase the maximum heap size. Another possible solution would be to partition the application across multiple cluster nodes, where each node has responsibility for managing a unique subset of the data. Conclusion In conclusion, R is a very powerful tool for the analysis of Java garbage collection log files. The primary difficulty is data cleansing so that information can be read into an R data frame. Once the data has been read into R, a rich set of tools may be used for thorough evaluation.

    Read the article

  • TFS Build errors TF224003, TF215085, TF215076

    - by iamdudley
    Hi, I am using TFS2008 and VS2008. I run nightly builds for about 20 applications using one build agent and the builds are scheduled for either 1am or 2am. Most of the build succeed, however 6 of them fail regularly with similar errors. The errors are either the first two below, or the third one by itself: TF215085: An error occurred while connecting to agent \xxxx\BUILDMACHINE: TF215076: Team Foundation Build on computer BUILDMACHINE (port 9191) is not responding. (Detail Message: The request was aborted: The operation has timed out.) 11/04/2010 2:10:10 AM TF224003: An exception occurred on the build computer BUILDMACHINE: The build (vstfs:///Build/Build/2632) has already completed and cannot be started again.. TF215085: An error occurred while connecting to agent \yyyyy\BA_WKSTFSBUILD: Team Foundation services are not available from server srvtfs. Technical information (for administrator): The operation has timed out It looks to me like some kind of communication error, maybe the port gets over loaded - can this happen? Should I spread the builds out a bit more? In the build definition it says "Queue the build on the default build agent at", so I figured if I scheduled them to start at the same time they would be queued and occur sequentially. Most of the suggestions I've found online for these errors are for all or nothing scenarios where no builds work at all whereas my problem is most build but some consistently do not. Judging by the dates of the last successful builds of these 6 failing builds I believe it is the same 6 failing every night. (I'm editing the build definitions now to keep the failed builds so I can get some more info on the problem) Any help on this would be much appreciated. James.

    Read the article

  • Algorithm design, "randomising" timetable schedule in Python although open to other languages.

    - by S1syphus
    Before I start I should add I am a musician and not a native programmer, this was undertook to make my life easier. Here is the situation, at work I'm given a new csv file each which contains a list of sound files, their length, and the minimum total amount of time they must be played. I create a playlist of exactly 60 minutes, from this excel file. Each sample played the by the minimum number of instances, but spread out from each other; so there will never be a period where for where one sound is played twice in a row or in close proximity to itself. Secondly, if the minimum instances of each song has been used, and there is still time with in the 60 min, it needs to fill the remaining time using sounds till 60 minutes is reached, while adhering to above. The smallest duration possible is 15 seconds, and then multiples of 15 seconds. Here is what I came up with in python and the problems I'm having with it, and as one user said its buggy due to the random library used in it. So I'm guessing a total rethink is on the table, here is where I need your help. Whats is the best way to solve the issue, I have had a brief look at things like knapsack and bin packing algorithms, while both are relevant neither are appropriate and maybe a bit beyond me.

    Read the article

  • HTML 4, HTML 5, XHTML, MIME types - the definitive resource

    - by deceze
    The topics of HTML vs. XHTML and XHTML as text/html vs. XHTML as XHTML are quite complex. Unfortunately it's hard to get a complete picture, since information is spread mostly in bits and pieces around the web or is buried deep in W3C tech jargon. In addition there's some misinformation being circulated. I propose to make this the definite SO resource about the topic, describing the most important aspects of: HTML 4 HTML 5 XHTML 1.0/1.1 as text/html XHTML 1.0/1.1 as XHTML What are the practical implications of each? What are common pitfalls? What is the importance of proper MIME types for each? How do different browsers handle them? I'd like to see one answer per technology. I'm making this a community wiki, so rather than contributing redundant answers, please edit answers to complete the picture. Feel free to start with stubs. Also feel free to edit this question.

    Read the article

  • HTML 4, HTML 5, XHTML, MIME types - the definite resource

    - by deceze
    The topics of HTML vs. XHTML and XHTML as text/html vs. XHTML as XHTML are quite complex. Unfortunately it's hard to get a complete picture, since information is spread mostly in bits and pieces around the web or is buried deep in W3C tech jargon. In addition there's some misinformation being circulated. I propose to make this the definite SO resource about the topic, describing the most important aspects of: HTML 4 HTML 5 XHTML 1.0/1.1 as text/html XHTML 1.0/1.1 as XHTML What are the practical implications of each? What are common pitfalls? What is the importance of proper MIME types for each? How do different browsers handle them? I'd like to see one answer per technology. I'm making this a community wiki, so rather than contributing redundant answers, please edit answers to complete the picture. Feel free to start with stubs. Also feel free to edit this question.

    Read the article

  • C# using namespace directive in nested namespaces

    - by MoSlo
    Right, I've usually used 'using' directives as follows using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace AwesomeLib { //awesome award winning class declarations making use of Linq } i've recently seen examples of such as using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace AwesomeLib { //awesome award winning class declarations making use of Linq namespace DataLibrary { using System.Data; //Data access layers and whatnot } } Granted, i understand that i can put USING inside of my namespace declaration. Such a thing makes sense to me if your namespaces are in the same root (they organized). System; namespace 1 {} namespace 2 { System.data; } But what of nested namespaces? Personally, I would leave all USING declarations at the top where you can find them easily. Instead, it looks like they're being spread all over the source file. Is there benefit to the USING directives being used this way in nested namespaces? Such as memory management or the JIT compiler?

    Read the article

< Previous Page | 33 34 35 36 37 38 39 40 41 42 43 44  | Next Page >