I'd like to debug an issue I'm having with a linux (debian stable) server, but I'm running out of ideas of how to confirm any diagnosis.
Some background: The servers are running DL160 class with hardware raid between two disks. They're running a lot of services, mostly utilising network interface and CPU. There are 8 cpus and 7 "main" most cpu-hungry processes are bound to one core each via cpu affinity. Other random background scripts are not forced anywhere. The filesystem is writing ~1.5k blocks/s the whole time (goes up above 2k/s in peak times). Normal CPU usage for those servers is ~60% on 7 cores and some minimal usage on the last (whatever's running on shells usually).
What actually happens is that the "main" services start using 100% CPU at some point, mainly stuck in kernel time. After a couple of seconds, LA goes over 400 and we lose any way to connect to the box (KVM is on it's way, but not there yet). Sometimes we see a kernel reporting hung task (but not always):
[118951.272884] INFO: task zsh:15911 blocked for more than 120 seconds.
[118951.272955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[118951.273037] zsh D 0000000000000000 0 15911 1
[118951.273093] ffff8101898c3c48 0000000000000046 0000000000000000 ffffffffa0155e0a
[118951.273183] ffff8101a753a080 ffff81021f1c5570 ffff8101a753a308 000000051f0fd740
[118951.273274] 0000000000000246 0000000000000000 00000000ffffffbd 0000000000000001
[118951.273335] Call Trace:
[118951.273424] [<ffffffffa0155e0a>] :ext3:__ext3_journal_dirty_metadata+0x1e/0x46
[118951.273510] [<ffffffff804294f6>] schedule_timeout+0x1e/0xad
[118951.273563] [<ffffffff8027577c>] __pagevec_free+0x21/0x2e
[118951.273613] [<ffffffff80428b0b>] wait_for_common+0xcf/0x13a
[118951.273692] [<ffffffff8022c168>] default_wake_function+0x0/0xe
....
This would point at raid / disk failure, however sometimes the tasks are hung on kernel's gettsc which would indicate some general weird hardware behaviour.
It's also running mysql (almost read-only, 99% cache hit), which seems to spawn a lot more threads during the system problems. During the day it does ~200kq/s (selects) and ~10q/s (writes).
The host is never running out of memory or swapping, no oom reports are spotted.
We've got many boxes with similar/same hardware and they all seem to behave that way, but I'm not sure which part fails, so it's probably not a good idea to just grab something more powerful and hope the problem goes away.
Applications themselves don't really report anything wrong when they're running. I can run anything safely on the same hardware in an isolated environment. What can I do to narrow down the problem? Where else should I look for explanation?