Hello,
We have a CF8 app that runs for 20-25 minutes before crashing under heavy load ~ 1200 users. This load is generated by our load testing tool: 1200 users ramped up in 5 mins (approx behavior of our users), running for an hour.
We have this app on Solaris 10, Apache 2, JRun 4 and Oracle 10g. Java version is 1.6.
During the initial load tests, the thread dumps pointed to monitor deadlocks that pointed to sessions.
"jrpp-173":
waiting to lock monitor 0x019fdc60 (object 0x6b893530, a java.util.Hashtable),
which is held by "scheduler-1"
"scheduler-1":
waiting to lock monitor 0x026c3ce0 (object 0x6abe2f20, a coldfusion.monitor.memory.SessionMemoryMonitor$TopMemoryUsedSessions),
which is held by "jrpp-167"
"jrpp-167":
waiting to lock monitor 0x019fdc60 (object 0x6b893530, a java.util.Hashtable),
which is held by "scheduler-1"
We increased the number of sessions relative to the number of CPUs (48 simultaneous threads against 32 CPUs), and the deadlock went away. While varying the simultaneous threads helped a little bit in terms of response time, the CF server still tanked in 20-25 minutes during all of these tests.
We ran more thread dumps, and saw a thread locking a monitor, for e.g.:
"jrpp-475" prio=3 tid=0x02230800 nid=0x2c5 runnable [0x4397d000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.getEntry(HashMap.java:347)
at java.util.HashMap.containsKey(HashMap.java:335)
at java.util.HashSet.contains(HashSet.java:184)
at coldfusion.monitor.memory.MemoryTracker.onAddObject(MemoryTracker.java:124)
at coldfusion.monitor.memory.MemoryTrackerProxy.onReplaceValue(MemoryTrackerProxy.java:598)
at coldfusion.monitor.memory.MemoryTrackerProxy.onPut(MemoryTrackerProxy.java:510)
at coldfusion.util.CaseInsensitiveMap.put(CaseInsensitiveMap.java:250)
at coldfusion.util.FastHashtable.put(FastHashtable.java:43)
- locked <0x6f7e1a78> (a coldfusion.runtime.Struct)
at coldfusion.runtime.CfJspPage._arrayset(CfJspPage.java:1027)
at coldfusion.runtime.CfJspPage._arraySetAt(CfJspPage.java:2117)
at cfvalidation2ecfc1052964961$funcSETUSERAUDITDATA.runFunction(/app/docs/apply/cfcs/validation.cfc:377)
As you see in the last line above there were several references CFMs and CFCs, and the lines have "cflock" tags, which were scoped to the "application." We (the dev team) then changed them to be scoped to a "name".
After more load tests, there is no locking going on and there no deadlocks, but now the application tanks in 7-10 minutes.
We've gotten system, network and DB reports from the respective admins, and they are not being taxed; even watched the server stats with server monitor, top, prstat, ran sar reports, etc. So we believe it is an issue with the CF server or maybe the JVM.
I am running out of ideas as to what else we can try. Disclaimer: I am not a CF developer or Admin. I am just running the load test, analyzing the reports, threads etc, and sharing the results with the dev and admin teams, and trying the next change, and so on. So far no dice.
Has anyone run into something similar? How did you go about diagnosing and troubleshooting? All thoughts and pointers welcome.
Thank you for your time!
KM