Random servers in Citrix servers suddenly bluescreens (mostly 0x0000008e and 0x0000007e)
Posted
by
Rasmus Rask
on Server Fault
See other posts from Server Fault
or by Rasmus Rask
Published on 2012-12-11T02:31:47Z
Indexed on
2012/12/11
5:06 UTC
Read the original article
Hit count: 721
I'm responsible for a Citrix Presentation Server 4.5 farm. Starting Friday 30. November, my servers started to crash randomly. So far we've experienced 80 crashes, so it's obviously becoming an increasingly big problem for us. I have 12+ years experience with IT, so I know the difference between 0 and 1, but I have a hard time cracking this.
We've rolled back any recent changes I can think of for different groups of servers, but all groups still seem to crash. I don't have the skills to interpret the memory dumps to find the culprit.
- Has anyone encountered the same or a similar problem? - might be a generic Windows issue
- Other than executing "analyze -v" in WinDbg, how do I work my way through the memory dumps to see what actually triggered the BSOD?
- Any suggested steps in getting to the bottom of this?
Any help is greatly appreciated. I can also provide links to kernel memory dumps or WinDbg output if necessary.
Thanks!
Problem description
The majority of the STOP errors we encounter are:
- 0x0000008e KERNEL_MODE_EXCEPTION_NOT_HANDLED (50%)
- 0x0000007e SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (26%)
- 0x00000050 PAGE_FAULT_IN_NONPAGED_AREA (21%)
We also see a few 0x0000000a IRQL_NOT_LESS_OR_EQUAL (3%).
For both 0x0000008e and 0x0000007e bug checks, the exception code is 0xc0000005 (Access Violation). When opening dump files in WinDbg, most details are exactly the same, for all the 0x0000008e and 0x0000007e bug checks respectively:
0x0000008e
- Exception address: 0x808bc9e3
- Trap frame: [varies]
- FAILURE_BUCKET_ID: 0x8E_nt!HvpGetCellMapped+97
- Probably Caused by (IMAGE_NAME): ntkrpamp.exe
0x0000007e
- Exception address: 0x808369b6
- Exception record address: 0xf70d3be0
- Context record address: 0xf70d38dc
- FAILURE_BUCKET_ID: 0x7E_nt!MmPurgeSection+14
- Probably Caused by: memory_corruption
About 30% of the crashes happens between 17:00 and 19:00, which leads me to believe this tend to happen more often during logoffs. But then again, only ~15% occurs between 15:00 and 17:00.
Summary of farm
- Citrix Presentation Server 4.5 R06 on Windows Server 2003 R2 SP2
- All high priority patches, at least as of October installed
- Virtualized using VMWare ESX/vSphere 4.1 on HP Proliant BL460c G6 blade servers
- About 53 Presentation Servers in production, divided into three silos - only one of which, the largest, is affected
- 2 vCPU's (5 GHz reserved), 8 GB RAM (all reserved) for each Presentation Server
- Plenty of free disk space
- Very few printer drivers - automated deletion of non-approved drivers every night
- ~1.000 peak concurrent users, which is reached at around 10:30 (on weekdays)
- Number of sessions steadily decline between 15:00 and 19:00 to ~230
© Server Fault or respective owner