information quality - Page 40

How to get more information from the system crash

- by viraptor

I'd like to debug an issue I'm having with a linux (debian stable) server, but I'm running out of ideas of how to confirm any diagnosis. Some background: The servers are running DL160 class with hardware raid between two disks. They're running a lot of services, mostly utilising network interface and CPU. There are 8 cpus and 7 "main" most cpu-hungry processes are bound to one core each via cpu affinity. Other random background scripts are not forced anywhere. The filesystem is writing ~1.5k blocks/s the whole time (goes up above 2k/s in peak times). Normal CPU usage for those servers is ~60% on 7 cores and some minimal usage on the last (whatever's running on shells usually). What actually happens is that the "main" services start using 100% CPU at some point, mainly stuck in kernel time. After a couple of seconds, LA goes over 400 and we lose any way to connect to the box (KVM is on it's way, but not there yet). Sometimes we see a kernel reporting hung task (but not always): [118951.272884] INFO: task zsh:15911 blocked for more than 120 seconds. [118951.272955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [118951.273037] zsh D 0000000000000000 0 15911 1 [118951.273093] ffff8101898c3c48 0000000000000046 0000000000000000 ffffffffa0155e0a [118951.273183] ffff8101a753a080 ffff81021f1c5570 ffff8101a753a308 000000051f0fd740 [118951.273274] 0000000000000246 0000000000000000 00000000ffffffbd 0000000000000001 [118951.273335] Call Trace: [118951.273424] [<ffffffffa0155e0a>] :ext3:__ext3_journal_dirty_metadata+0x1e/0x46 [118951.273510] [<ffffffff804294f6>] schedule_timeout+0x1e/0xad [118951.273563] [<ffffffff8027577c>] __pagevec_free+0x21/0x2e [118951.273613] [<ffffffff80428b0b>] wait_for_common+0xcf/0x13a [118951.273692] [<ffffffff8022c168>] default_wake_function+0x0/0xe .... This would point at raid / disk failure, however sometimes the tasks are hung on kernel's gettsc which would indicate some general weird hardware behaviour. It's also running mysql (almost read-only, 99% cache hit), which seems to spawn a lot more threads during the system problems. During the day it does ~200kq/s (selects) and ~10q/s (writes). The host is never running out of memory or swapping, no oom reports are spotted. We've got many boxes with similar/same hardware and they all seem to behave that way, but I'm not sure which part fails, so it's probably not a good idea to just grab something more powerful and hope the problem goes away. Applications themselves don't really report anything wrong when they're running. I can run anything safely on the same hardware in an isolated environment. What can I do to narrow down the problem? Where else should I look for explanation?

Search Results

Search found 27233 results on 1090 pages for 'information quality'.

Page 40/1090 | < Previous Page | 36 37 38 39 40 41 42 43 44 45 46 47 | Next Page >

- by viraptor

- by Assaf

- by Robert

- by AbLincon

- by redeye

- by Michael Robinson

- by Wayne Arthurton

- by Patrick

- by Freiheit

- by kylex

- by Konrads

- by Mykroft

- by Jason

- by jay

- by redeye

- by pdanke

- by Nordberg

- by RainDoctor

- by Pravin

- by George Bailey

- by Bharanidharan

- by Victor Bugarin

- by Joshua Hoblitt

- by abc

- by Debabratta

< Previous Page | 36 37 38 39 40 41 42 43 44 45 46 47 | Next Page >