The Importance of Fully Specifying a Problem
- by Alan
I had a customer call this week where we were provided a forced crashdump and asked to determine why the system was hung.
Normally when you are looking at a hung system, you will find a lot of threads blocked on various locks, and most likely very little actually running on the system (unless it's threads spinning on busy wait type locks).
This vmcore showed none of that. In fact we were seeing hundreds of threads actively on cpu in the second before the dump was forced.
This prompted the question back to the customer:
What exactly were you seeing that made you believe that the system was hung?
It took a few days to get a response, but the response that I got back was that they were not able to ssh into the system and when they tried to login to the console, they got the login prompt, but after typing "root" and hitting return, the console was no longer responsive.
This description puts a whole new light on the "hang". You immediately start thinking "name services".
Looking at the crashdump, yes the sshds are all in door calls to nscd, and nscd is idle waiting on responses from the network.
Looking at the connections I see a lot of connections to the secure ldap port in CLOSE_WAIT, but more interestingly I am seeing a few connections over the non-secure ldap port to a different LDAP server just sitting open.
My feeling at this point is that we have an either non-responding LDAP server, or one that is responding slowly, the resolution being to investigate that server.
Moral
When you log a service ticket for a "system hang", it's great to get the forced crashdump first up, but it's even better to get a description of what you observed to make to believe that the system was hung.