How to make Linux reliably boot on multi-cpu machines?
- by Adam Tabi
I've got two machines, one with 4x12 AMD Opteron cores (AMD Opteron(tm) Processor 6176), one with 2x8 Xeon cores (HT disabled; Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz). On both machines I experience difficulties during boot of Linux using recent kernels. The system hangs during the initialization of the kernel, before or just when initramfs started initializing the hardware. The last thing which got displayed was a stacktrace like this:
CPU: 31 PID: 0 Comm: swapper/31 Tainted: G D 3.11.6-hardened #11
Hardware name: Supermicro X9DRT-HF+/X9DRT-HF+, BIOS 3.00 07/08/2013
task: ffff880854695500 ti: ffff880854695a28 task.ti: ffff880854695a28
RIP: 0010:[<ffffffff8100a82e>] [<ffffffff8100a82e>] default_idle+0x6/0xe
RSP: 0000:ffff8808546b3ec8 EFLAGS: 00000286
RAX: ffffffff8100a828 RBX: ffff880854695a28 RCX: 00000000ffffffff
RDX: 0100000000000000 RSI: 0000000000000000 RDI: ffff88107fdec690
RBP: ffff8808546b3ec8 R08: 0000000000000000 R09: ffff880854695500
R10: ffff880854695500 R11: 0000000000000001 R12: ffff880854695a28
R13: ffff880854695a28 R14: ffff880854695a28 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88107fde0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000002b43256a960 CR3: 00000000016b5000 CR4: 00000000000607f0
Stack:
ffff8808546b3ed8 ffffffff8100aec9 ffff8808546b3f10 ffffffff8109ce25
334ab55852ec7aef 000000000000001f ffffffff8102d6c0 0000000000000000
0000000000000000 ffff8808546b3f48 ffffffff810276e0 ffff8808546b3f28
Call Trace:
[<ffffffff8100aec9>] arch_cpu_idle+0x20/0x2b
[<ffffffff8109ce25>] cpu_startup_entry+0xed/0x138
[<ffffffff8102d6c0>] ? flat_init_apic_ldr+0x80/0x80
[<ffffffff810276e0>] start_secondary+0x2c9/0x2f8
I compiled the kernel myself and it works fine, if I boot with nolapic. Yet, only one core is used. Also, the kernel of RHEL6 seems to work fine. I suspect that there are some patches used to make things work. Using the kernel config file from RHEL6 and building a more recent kernel yields the same problems. On the Xeon machine, things got better by disabling Hyperthreading completely. The machine now boots successfully on at least 4 out of 5 times. And if it boots, multicore stuff works just fine. However, I'm wondering about what to do about the AMD machine.
So to sum it up:
Gentoo kernel 3.6 - 3.11 won't reliably boot those machines unless you reduce the amount of cores (e.g. via nolapic).
RHEL6 kernel (which is 2.6.32) boots just fine.
RH kernel config used to build a 3.x kernel won't yield a working kernel.
Not distribution specific (apart from the kernel being used).
These stack traces got printed every minute or so. The kernel seems to be stuck in an endless loop.
Yet, a recent kernel is needed for various reasons.
So the question is:
What does the RHEL6 kernel do, what vanilla or gentoo kernels don't do?
Is there a boot option that might lead to a reliable boot with all the cores enabled?
Best,
Adam