Given a large Linux HPC cluster with hundreds/thousands of nodes. What are your best practices to get the best possible LINPACK benchmark (HPL) result to submit for the Top500 supercomputer list?
To give you an idea what kind of answers I would appreciate here are some sub-questions (with links):
How to you tune the parameters (N, NB, P, Q, memory-alignment, etc) for the HPL.dat file (without spending too much time trying each possible permutation - esp with large problem sizes N)?
Are there any Top500 submission rules to be aware of? What is allowed, what isn't?
Which MPI product, which version? Does it make a difference?
Any special host order in your MPI machine file?
Do you use CPU pinning?
How to you configure your interconnect? Which interconnect?
Which BLAS package do you use for which CPU model? (Intel MKL, AMD ACML, GotoBLAS2, etc.)
How do you prepare for the big run (on all nodes)? Start with small runs on a subset of nodes and then scale up? Is it really necessary to run LINPACK with a big run on all of the nodes (or is extrapolation allowed)?
How do you optimize for the latest Intel/AMD CPUs? Hyperthreading? NUMA?
Is it worth it to recompile the software stack or do you use precompiled binaries? Which settings?
Which compiler optimizations, which compiler? (What about profile-based compilation?)
How to get the best result given only a limited amount of time to do the benchmark run? (You can block a huge cluster forever)
How do you prepare the individual nodes (stopping system daemons, freeing memory, etc)?
How do you deal with hardware faults (ruining a huge run)?
Are there any must-read documents or websites about this topic? E.g. I would love to hear about some background stories of some of the current Top500 systems and how they did their LINPACK benchmark.
I deliberately don't want to mention concrete hardware details or discuss hardware recommendations because I don't want to limit the answers. However, feel free to mention hints e.g. for specific CPU models.