More interruptions than cpu context switches
- by Christopher Valles
I have a machine running Debian GNU/Linux 5.0.8 (lenny) 8 cores and 12Gb of RAM.
We have one core permanently around 40% ~ 60% wait time and trying to spot what is happening I realized that we have more interruptions than cpu context switches. I found that the normal ratio between context switch and interruptions is around 10x more context switching than interruptions but on my server the values are completely different.
backend1:~# vmstat -s
12330788 K total memory
12221676 K used memory
3668624 K active memory
6121724 K inactive memory
109112 K free memory
3929400 K buffer memory
4095536 K swap cache
4194296 K total swap
7988 K used swap
4186308 K free swap
44547459 non-nice user cpu ticks
702408 nice user cpu ticks
13346333 system cpu ticks
1607583668 idle cpu ticks
374043393 IO-wait cpu ticks
4144149 IRQ cpu ticks
3994255 softirq cpu ticks
0 stolen cpu ticks
4445557114 pages paged in
2910596714 pages paged out
128642 pages swapped in
267400 pages swapped out
3519307319 interrupts
2464686911 CPU context switches
1306744317 boot time
11555115 forks
Any ideas if that is an issue? And in that case, how can I spot the cause and fix it?
Update
Following the instructions of the comments and focusing on the core stuck in wait I checked the processes attached to that core and below you can find the list:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
24 root RT -5 0 0 0 S 0 0.0 0:03.42 7 migration/7
25 root 15 -5 0 0 0 S 0 0.0 0:04.78 7 ksoftirqd/7
26 root RT -5 0 0 0 S 0 0.0 0:00.00 7 watchdog/7
34 root 15 -5 0 0 0 S 0 0.0 1:18.90 7 events/7
83 root 15 -5 0 0 0 S 0 0.0 1:10.68 7 kblockd/7
291 root 15 -5 0 0 0 S 0 0.0 0:00.00 7 aio/7
569 root 15 -5 0 0 0 S 0 0.0 0:00.00 7 ata/7
1545 root 15 -5 0 0 0 S 0 0.0 0:00.00 7 ksnapd
1644 root 15 -5 0 0 0 S 0 0.0 0:36.73 7 kjournald
1725 root 16 -4 16940 1152 488 S 0 0.0 0:00.00 7 udevd
2342 root 20 0 8828 1140 956 S 0 0.0 0:00.00 7 sh
2375 root 20 0 8848 1220 1016 S 0 0.0 0:00.00 7 locate
2421 root 30 10 8896 1268 1016 S 0 0.0 0:00.00 7 updatedb.findut
2430 root 30 10 58272 49m 616 S 0 0.4 0:17.44 7 sort
2431 root 30 10 3792 448 360 S 0 0.0 0:00.00 7 frcode
2682 root 15 -5 0 0 0 S 0 0.0 3:25.98 7 kjournald
2683 root 15 -5 0 0 0 S 0 0.0 0:00.64 7 kjournald
2687 root 15 -5 0 0 0 S 0 0.0 1:31.30 7 kjournald
3261 root 15 -5 0 0 0 S 0 0.0 2:30.56 7 kondemand/7
3364 root 20 0 3796 596 476 S 0 0.0 0:00.00 7 acpid
3575 root 20 0 8828 1140 956 S 0 0.0 0:00.00 7 sh
3597 root 20 0 8848 1216 1016 S 0 0.0 0:00.00 7 locate
3603 root 30 10 8896 1268 1016 S 0 0.0 0:00.00 7 updatedb.findut
3612 root 30 10 58272 49m 616 S 0 0.4 0:27.04 7 sort
3655 root 20 0 11056 2852 516 S 0 0.0 5:36.46 7 redis-server
3706 root 20 0 19832 1056 816 S 0 0.0 0:01.64 7 cron
3746 root 20 0 3796 580 484 S 0 0.0 0:00.00 7 getty
3748 root 20 0 3796 580 484 S 0 0.0 0:00.00 7 getty
7674 root 20 0 28376 1000 736 S 0 0.0 0:00.00 7 cron
7675 root 20 0 8828 1140 956 S 0 0.0 0:00.00 7 sh
7708 root 30 10 58272 49m 616 S 0 0.4 0:03.36 7 sort
22049 root 20 0 8828 1136 956 S 0 0.0 0:00.00 7 sh
22095 root 20 0 8848 1220 1016 S 0 0.0 0:00.00 7 locate
22099 root 30 10 8896 1264 1016 S 0 0.0 0:00.00 7 updatedb.findut
22108 root 30 10 58272 49m 616 S 0 0.4 0:44.55 7 sort
22109 root 30 10 3792 452 360 S 0 0.0 0:00.00 7 frcode
26927 root 20 0 8828 1140 956 S 0 0.0 0:00.00 7 sh
26947 root 20 0 8848 1216 1016 S 0 0.0 0:00.00 7 locate
26951 root 30 10 8896 1268 1016 S 0 0.0 0:00.00 7 updatedb.findut
26960 root 30 10 58272 49m 616 S 0 0.4 0:10.24 7 sort
26961 root 30 10 3792 452 360 S 0 0.0 0:00.00 7 frcode
27952 root 20 0 65948 3028 2400 S 0 0.0 0:00.00 7 sshd
30731 root 20 0 0 0 0 S 0 0.0 0:01.34 7 pdflush
31204 root 20 0 0 0 0 S 0 0.0 0:00.24 7 pdflush
21857 deploy 20 0 1227m 2240 868 S 0 0.0 2:44.22 7 nginx
21858 deploy 20 0 1228m 2784 868 S 0 0.0 2:42.45 7 nginx
21862 deploy 20 0 1228m 2732 868 S 0 0.0 2:43.90 7 nginx
21869 deploy 20 0 1228m 2840 868 S 0 0.0 2:44.14 7 nginx
27994 deploy 20 0 19372 2216 1380 S 0 0.0 0:00.00 7 bash
28493 deploy 20 0 331m 32m 16m S 4 0.3 0:00.40 7 apache2
21856 deploy 20 0 1228m 2844 868 S 0 0.0 2:43.64 7 nginx
3622 nobody 30 10 21156 10m 916 D 0 0.1 4:42.31 7 find
7716 nobody 30 10 12268 1280 888 D 0 0.0 0:43.50 7 find
22116 nobody 30 10 12612 1696 916 D 0 0.0 6:32.26 7 find
26968 nobody 30 10 12268 1284 888 D 0 0.0 1:56.92 7 find
Update
As suggested I take a look at /proc/interrupts and below the info there:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 35 0 0 1469085485 0 0 0 0 IO-APIC-edge timer
1: 0 0 0 8 0 0 0 0 IO-APIC-edge i8042
8: 0 0 0 1 0 0 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
12: 0 0 0 105 0 0 0 0 IO-APIC-edge i8042
16: 0 0 0 0 0 0 0 580212114 IO-APIC-fasteoi 3w-9xxx, uhci_hcd:usb1
18: 0 0 142 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb6, ehci_hcd:usb7
19: 9 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3, uhci_hcd:usb5
21: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2
23: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4, ehci_hcd:usb8
1273: 0 0 1600400502 0 0 0 0 0 PCI-MSI-edge eth0
1274: 0 0 0 0 0 0 0 0 PCI-MSI-edge ahci
NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts
LOC: 214252181 69439018 317298553 21943690 72562482 56448835 137923978 407514738 Local timer interrupts
RES: 27516446 16935944 26430972 44957009 24935543 19881887 57746906 24298747 Rescheduling interrupts
CAL: 10655 10705 10685 10567 10689 10669 10667 396 function call interrupts
TLB: 529548 462587 801138 596193 922202 747313 2027966 946594 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
ERR: 0
All the values seems more or less the same for all the cores but this one IO-APIC-fasteoi 3w-9xxx, uhci_hcd:usb1 only affects to the core 7 (the same with the wait time of 40% ~ 60%) could be something attached to the usb port causing the issue?
Thanks in advanced