Not sure if this would be better suited for ServerFault, but since I am not an admin but a developer I figured I would try SO.
We've been struggling to keep our multi-server configuration stable for quite some time now. At the end of last month we were running under CF 7.0.2 on a two servers setup (one instance each). At that point we managed to get our uptime to around 1 week per instance before they would restart by themselves. Since the beginning of the month we upgraded to CF 9 and we're back to square one with multi-restart a day.
Our current configuration is 2 Win2k3 servers, running a cluster of 4 instances, 2 instances per server. At this point we are pretty certain this is due to improper JVM settings.
We've been toying with them and while some are more stable than others we never quite got it right.
From the default:
java.args=-server -Xmx512m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=192m -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/
To currently:
java.args=-server -Xmx896m -Dsun.io.useCanonCaches=false -XX:MaxPermSize=512m -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:+UseParallelGC -Dcoldfusion.rootDir={application.home}/ -verbose:gc -Xloggc:c:/Jrun4/logs/gc/gcInstance1b.log
We have determined that we do need more than the default 512MB simply by monitoring with FusionReactor, on average our amount of memory consumed is hovering in the mid 300MB and can go up to low 700MB under heavy load.
Most of the crash will be logged in jrun4/bin/hs_err_pid*.log always an "Out of swap space"
I've attached links to the hs_err and garbage collector log file from yesterday at the bottom of the post.
The relevant part is (I think) this:
Heap
PSYoungGen total 89856K, used 19025K [0x55490000, 0x5b6f0000, 0x5b810000)
eden space 79232K, 16% used [0x55490000,0x561a64c0,0x5a1f0000)
from space 10624K, 52% used [0x5ac90000,0x5b20e2f8,0x5b6f0000)
to space 10752K, 0% used [0x5a1f0000,0x5a1f0000,0x5ac70000)
PSOldGen total 460416K, used 308422K [0x23810000, 0x3f9b0000, 0x55490000)
object space 460416K, 66% used [0x23810000,0x36541bb8,0x3f9b0000)
PSPermGen total 107520K, used 106079K [0x03810000, 0x0a110000, 0x23810000)
object space 107520K, 98% used [0x03810000,0x09fa7e40,0x0a110000)
From it, I gather that its the PSPermGen that is full (most logs will show the same before a crash), which is why we increased MaxPermSize but the total still show as 107520K!??!
No one here is a jRun expert, so any help or even ideas on what to try next would be greatly appreciated!!
The log files:
Sorry I know sendspace isn't the friendliest of places - if you have other host suggestion for log files let me know and I'll update the post (SO doesn't like them inline, it blows up the format of the post).
The hs_err log file: http://www.sendspace.com/file/fgak8l
The gc log: http://www.sendspace.com/file/w0r2ct