How do I troubleshoot root cause of a hung windows (2003) server?
- by GregW
I have a pair of Windows (2003 Server) servers both running MS SQL Server (2008 EE) that each hang every few months. This has been occurring intermittently :( for the past 15 months pretty much since we started using the servers.
The symptoms are as-follows:
I cannot remote desktop in to troubleshoot; when I attempt to, I get stuck on a blank black screen and am never offered a login prompt
I can still ping the servers
I can still open a SQL connection to the server, and, CURIOUSLY/BIZARRELY, when I do a "select getdate()", the time it returns appears to be stuck on the exact fraction of a second when (I presume) the server hung. Repeated attempts to do "select getdate()" keep getting that same date, suggesting that the clock is frozen.
Filesharing attempts to connect to the hung server fail with the error message: "\ServerName is not accessible. You might not have permissions to use this network resource. Contact the administrator of this server to find out if you have access permissions. The server's clock is not synchronized with the primary domain controller's clock." This is consistent with a frozen clock.
Post-reboot, if I investigate the Windows Event Viewer logs, I can see many security accesses (coming from me and others) that I recognize were login attempts during the "down" period, but all of them in the security log are associated with that same timestamp of when the server hung. This also suggests the clock is frozen. There is not a clear cause in the Application or System event logs.
I have a local Admin account on the server and am in the process of getting a domain-credentialed Admin account for better remote admin access.
HP is supposed to be supporting these machines and has some low-level ILO2 access but they seem incapable of finding the root cause.
A reboot will "fix" the problem but I would like to get to the root cause and solve the issue. Has anyone ever seen something like this odd clock behavior?! (If it were just one server I'd perhaps say a bad hardware clock, but two?) Can anyone advise me on what I should try to troubleshoot this sort of situation to find the root cause (or what I should tell HP to try?)