XCP Project Kronos syslog error: "irq ... : nobody cared" on Dom0 host
- by Vlad Fedin
One of our production clusters driven by XCP suddenly went uresponsive. After restart and some investigation we found such logs in dom0 machine syslog:
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659040] irq 339: nobody cared (try booting with the "irqpoll" option)
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659058] Pid: 0, comm: swapper/3 Tainted: G C O 3.2.0-24-generic #37-Ubuntu
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659060] Call Trace:
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659062] <IRQ> [<ffffffff810db37d>] __report_bad_irq+0x3d/0xe0
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659071] [<ffffffff810db605>] note_interrupt+0x135/0x190
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659074] [<ffffffff810d8e69>] handle_irq_event_percpu+0xa9/0x220
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659078] [<ffffffff8130ff3b>] ? radix_tree_lookup+0xb/0x10
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659081] [<ffffffff810d9031>] handle_irq_event+0x51/0x80
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659084] [<ffffffff810dc187>] handle_edge_irq+0x87/0x140
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659089] [<ffffffff813a8829>] __xen_evtchn_do_upcall+0x199/0x250
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659092] [<ffffffff813aa96f>] xen_evtchn_do_upcall+0x2f/0x50
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659096] [<ffffffff81666d3e>] xen_do_hypervisor_callback+0x1e/0x30
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659097] <EOI> [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659104] [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659107] [<ffffffff8100a1d0>] ? xen_safe_halt+0x10/0x20
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659110] [<fff
IRQ 339 in cat /proc/interrupts:
339: ... xen-pirq-msi-x eth0
where eth0 is hardware NIC.
While host machine seems to hang, guest machines continue to work, so our tiny internal monitoring on one of the virtual hosts logged something like that:
[2012-10-26 20:31:51] [OK......] 200 OK : 113159149 ns
[2012-10-26 20:32:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 47763284432 ns
...
[2012-10-26 20:34:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 46894835070 ns
[2012-10-26 20:34:57] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 16821741955 ns
...
[2012-10-26 20:38:18] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 20103298289 ns
[2012-10-26 20:38:37] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 17895754943 ns
Host and guest OS: Ubuntu 12.04 LTS,
05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Subsystem: ASUSTeK Computer Inc. Device 8369
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 17
Region 0: Memory at fe500000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at e000 [size=32]
Region 3: Memory at fe520000 (32-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: e1000e
Kernel modules: e1000e
Any hints how to debug this?