IRQ problem with 2.6.32/2.6.39 kernel on Debian Squeeze x86_64
- by MasterM
I recently assembled a new computer so that all hardware is pretty new. Since then I've been experiencing some problem with IRQs when running Debian 6.0. On random occasions, usually after an hour or so of running I hear a beep and this shows up in dmesg:
[ 3537.762795] irq 16: nobody cared (try booting with the "irqpoll" option)
[ 3537.762797] Pid: 0, comm: swapper Tainted: P W O 2.6.39-2-amd64 #1
[ 3537.762798] Call Trace:
[ 3537.762799] <IRQ> [<ffffffff810924d4>] ? __report_bad_irq+0x3a/0xa2
[ 3537.762803] [<ffffffff810926a4>] ? note_interrupt+0x168/0x1da
[ 3537.762805] [<ffffffff81090dd4>] ? handle_irq_event_percpu+0x171/0x18f
[ 3537.762807] [<ffffffff8100e0e2>] ? read_tsc+0x5/0x16
[ 3537.762809] [<ffffffff8106b8a2>] ? update_ts_time_stats+0x32/0x6b
[ 3537.762810] [<ffffffff81090e26>] ? handle_irq_event+0x34/0x52
[ 3537.762812] [<ffffffff81063fb7>] ? sched_clock_idle_wakeup_event+0x12/0x1c
[ 3537.762813] [<ffffffff81092df2>] ? handle_fasteoi_irq+0x82/0xa4
[ 3537.762815] [<ffffffff8100aadb>] ? handle_irq+0x1a/0x23
[ 3537.762816] [<ffffffff8100a384>] ? do_IRQ+0x45/0xaa
[ 3537.762818] [<ffffffff81332c93>] ? common_interrupt+0x13/0x13
[ 3537.762818] <EOI> [<ffffffff81332c8e>] ? common_interrupt+0xe/0x13
[ 3537.762821] [<ffffffff81026800>] ? native_safe_halt+0x2/0x3
[ 3537.762829] [<ffffffffa016ed58>] ? acpi_idle_do_entry+0x39/0x62 [processor]
[ 3537.762831] [<ffffffffa016edde>] ? acpi_idle_enter_c1+0x5d/0xad [processor]
[ 3537.762834] [<ffffffff81261033>] ? cpuidle_idle_call+0x11f/0x1cc
[ 3537.762835] [<ffffffff81008dd2>] ? cpu_idle+0xab/0xe1
[ 3537.762837] [<ffffffff8169fc60>] ? start_kernel+0x3e0/0x3eb
[ 3537.762838] [<ffffffff8169f3c8>] ? x86_64_start_kernel+0x102/0x10f
[ 3537.762839] handlers:
[ 3537.762840] [<ffffffffa0358d5a>] (rtl8169_interrupt+0x0/0x2d7 [r8169])
[ 3537.762842] [<ffffffffa08ff2ca>] (nv_kern_isr+0x0/0x54 [nvidia])
[ 3537.762902] Disabling IRQ #16
After that Xorg either hogs on CPU or is unstable (up to hanging the system completely). When I restart Xorg everything is fine again and the problem doesn't occur until next reboot.
I tried to upgrade the kernel from stock 2.6.32 to 2.6.39 from unstable repository but that didn't help. Booting with irqpoll option only seems to prolong the initial time period after which the problem occurs.
I'm using latest NVIDIA drivers and Realtek firmware from firmware-realtek package. I have two GTX 560Ti that run in SLI. Disabling SLI or taking out one card completely doesn't solve the problem either.
Output of uname -a is:
Linux whitestar 2.6.39-2-amd64 #1 SMP Wed Jun 8 11:01:04 UTC 2011 x86_64 GNU/Linux
Output of lspci is:
00:00.0 Host bridge: Intel Corporation Sandy Bridge DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Sandy Bridge PCI Express Root Port (rev 09)
00:01.1 PCI bridge: Intel Corporation Sandy Bridge PCI Express Root Port (rev 09)
00:16.0 Communication controller: Intel Corporation Cougar Point HECI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network Connection (rev 05)
00:1a.0 USB Controller: Intel Corporation Cougar Point USB Enhanced Host Controller #2 (rev 05)
00:1b.0 Audio device: Intel Corporation Cougar Point High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation Cougar Point PCI Express Root Port 1 (rev b5)
00:1c.1 PCI bridge: Intel Corporation Cougar Point PCI Express Root Port 2 (rev b5)
00:1c.2 PCI bridge: Intel Corporation Cougar Point PCI Express Root Port 3 (rev b5)
00:1c.4 PCI bridge: Intel Corporation Cougar Point PCI Express Root Port 5 (rev b5)
00:1c.6 PCI bridge: Intel Corporation 82801 PCI Bridge (rev b5)
00:1d.0 USB Controller: Intel Corporation Cougar Point USB Enhanced Host Controller #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation Cougar Point LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation Cougar Point 6 port SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation Cougar Point SMBus Controller (rev 05)
01:00.0 VGA compatible controller: nVidia Corporation Device 1200 (rev a1)
01:00.1 Audio device: nVidia Corporation Device 0e0c (rev a1)
02:00.0 VGA compatible controller: nVidia Corporation Device 1200 (rev a1)
02:00.1 Audio device: nVidia Corporation Device 0e0c (rev a1)
04:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04)
06:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04)
07:00.0 PCI bridge: Device 1b21:1080 (rev 01)
08:02.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8110SC/8169SC Gigabit Ethernet (rev 10)
08:03.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller (rev c0)
Contents of /proc/interrupts:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 77 0 0 0 0 0 0 0 IO-APIC-edge timer
1: 2 0 0 0 0 0 0 0 IO-APIC-edge i8042
8: 1 0 0 0 0 0 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
12: 4 0 0 0 0 0 0 0 IO-APIC-edge i8042
16: 699083 0 0 0 0 0 0 0 IO-APIC-fasteoi nvidia, eth0
17: 87810 0 0 0 0 0 0 0 IO-APIC-fasteoi firewire_ohci, hda_intel, nvidia
18: 242 0 0 0 0 0 0 0 IO-APIC-fasteoi hda_intel
23: 85925 0 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb5, ehci_hcd:usb6
40: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
41: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
42: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
43: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
44: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
45: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
46: 79853 0 0 0 0 0 0 0 PCI-MSI-edge ahci
48: 1 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
49: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
50: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
51: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
52: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
53: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
54: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
55: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
56: 1 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
57: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
58: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
59: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
60: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
61: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
62: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
63: 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
64: 173506 0 0 0 0 0 0 0 PCI-MSI-edge hda_intel
NMI: 482 89 25 13 277 24 11 10 Non-maskable interrupts
LOC: 783857 194752 114133 70577 372438 179065 117179 162016 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 482 89 25 13 277 24 11 10 Performance monitoring interrupts
IWI: 0 0 0 0 0 0 0 0 IRQ work interrupts
RES: 131917 46750 7432 3291 150003 9576 3435 3067 Rescheduling interrupts
CAL: 2759 6563 7150 6997 5387 7140 7269 6678 Function call interrupts
TLB: 4396 2038 1336 492 5434 1896 1121 606 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 37 37 37 37 37 37 37 37 Machine check polls
ERR: 0
MIS: 0
Last but not least, right after boot-up those lines are usually present in dmesg:
[ 18.367094] hda-intel: IRQ timing workaround is activated for card #1. Suggest a bigger bdl_pos_adj.
[ 18.458859] hda-intel: IRQ timing workaround is activated for card #2. Suggest a bigger bdl_pos_adj.
I'm not sure if it's related or a symptom of a bigger problem so I'm posting it just in case.
I don't really know what other information might be of relevance here. Don't hesitate to ask for more in the comments.