Search Results

Search found 5 results on 1 pages for 'linpack'.

Page 1/1 | 1 

  • How to get the best LINPACK result and conquer the Top500?

    - by knweiss
    Given a large Linux HPC cluster with hundreds/thousands of nodes. What are your best practices to get the best possible LINPACK benchmark (HPL) result to submit for the Top500 supercomputer list? To give you an idea what kind of answers I would appreciate here are some sub-questions (with links): How to you tune the parameters (N, NB, P, Q, memory-alignment, etc) for the HPL.dat file (without spending too much time trying each possible permutation - esp with large problem sizes N)? Are there any Top500 submission rules to be aware of? What is allowed, what isn't? Which MPI product, which version? Does it make a difference? Any special host order in your MPI machine file? Do you use CPU pinning? How to you configure your interconnect? Which interconnect? Which BLAS package do you use for which CPU model? (Intel MKL, AMD ACML, GotoBLAS2, etc.) How do you prepare for the big run (on all nodes)? Start with small runs on a subset of nodes and then scale up? Is it really necessary to run LINPACK with a big run on all of the nodes (or is extrapolation allowed)? How do you optimize for the latest Intel/AMD CPUs? Hyperthreading? NUMA? Is it worth it to recompile the software stack or do you use precompiled binaries? Which settings? Which compiler optimizations, which compiler? (What about profile-based compilation?) How to get the best result given only a limited amount of time to do the benchmark run? (You can block a huge cluster forever) How do you prepare the individual nodes (stopping system daemons, freeing memory, etc)? How do you deal with hardware faults (ruining a huge run)? Are there any must-read documents or websites about this topic? E.g. I would love to hear about some background stories of some of the current Top500 systems and how they did their LINPACK benchmark. I deliberately don't want to mention concrete hardware details or discuss hardware recommendations because I don't want to limit the answers. However, feel free to mention hints e.g. for specific CPU models.

    Read the article

  • Error while compiling Cuda Accelerated Linpack hpl_2.0_FERMI

    - by ghostrustam
    I use Ubuntu 11.04 x86_64 CUDA 4.0 OpenMpi 1.4stable MKL When I compile, I get this error: ar r -L/home/limksadmin/hpl-2.0_FERMI_v13/lib/CUDA/libhpl.a HPL_dlacpy.o HPL_dlatcpy.o HPL_fprintf.o HPL_warn.o HPL_abort.o HPL_dlaprnt.o HPL_dlange.o HPL_dlamch.o ar: -L/home/limksadmin/hpl-2.0_FERMI_v13/lib/CUDA/libhpl.a: No such file or directory make[2]: *** [lib.grd] Error 9 make[2]: Leaving directory `/home/limksadmin/hpl-2.0_FERMI_v13/src/auxil/CUDA' make[1]: *** [build_src] Error 2 make[1]: Leaving directory `/home/limksadmin/hpl-2.0_FERMI_v13' make: *** [build] Error 2 Make.CUDA: LAdir = /opt/intel/mkl/lib/intel64 LAlib = -L $(TOPdir)/src/cuda -ldgemm -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 MPdir = /usr/local/mpi/openmpi MPinc = -I$(MPdir)/include MPlib = -L$(MPdir)/lib/libmpi.so CC = /usr/local/mpi/openmpi/bin/mpicc What could be the problem?

    Read the article

  • About the K computer

    - by nospam(at)example.com (Joerg Moellenkamp)
    Okay ? after getting yet another mail because of the new #1 on the Top500 list, I want to add some comments from my side: Yes, the system is using SPARC processor. And that is great news for a SPARC fan like me. It is using the SPARC VIIIfx processor from Fujitsu clocked at 2 GHz. No, it isn't the only one. Most people are saying there are two in the Top500 list using SPARC (#77 JAXA and #1 K) but in fact there are three. The Tianhe-1 (#2 on the Top500 list) super computer contains 2048 Galaxy "FT-1000" 1 GHz 8-core processors. Don't know it? The FeiTeng-1000 ? this proc is a 8 core, 8 threads per core, 1 ghz processor made in China. And it's SPARC based. By the way ? this sounds really familiar to me ? perhaps the people just took the opensourced UltraSPARC-T2 design, because some of the parameters sound just to similar. However it looks like that Tianhe-1 is using the SPARCs as input nodes and not as compute notes. No, I don't see it as the next M-series processor. Simple reason: You can't create SMP systems out of them ? it simply hasn't the functionality to do so. Even when there are multiple CPUs on a single board, they are not connected like an SMP/NUMA machine to a shared memory machine ? they are connected with the cluster interconnect (in this case the Tofu interconnect) and work like a large cluster. Yes, it has a lot of oomph in Linpack ? however I assume a lot came from the extensions to the SPARCv9 standard. No, Linpack has no relevance for any commercial workload ? Linpack is such a special load, that even some HPC people are arguing that it isn't really a good benchmark for HPC. It's embarrassingly parallel, it can work with relatively small interconnects compared to the interconnects in SMP systems (however we get in spheres SMP interconnects where a few years ago). Amdahl isn't hitting that hard when running Linpack. Yes, it's a good move to use SPARC. At some time in the last 10 years, there was an interesting twist in perception: SPARC was considered as proprietary architecture and x86 was the open architecture. However it's vice versa ? try to create a x86 clone and you have a lot of intellectual property problems, create a SPARC clone and you have to spend 100 bucks or so to get the specification from the SPARC Foundation and develop your own SPARC processor. Fujitsu is doing this for a long time now. So they had their own processor, their own know-how. So why was SPARC a good choice? Well ? essentially Fujitsu can do what they want with their core as it is their core, for example adding the extensions to the SPARCv9 chipset ? getting Intel to create extensions to x86 to help you with your product is a little bit harder. So Fujitsu could do they needed to do with their processor in order to create such a supercomputer. No, the K is really using no FPGA or GPU as accelerators. The K is really using the CPU at doing this job. Yes, it has a significantly enhanced FPU capable to execute 8 instructions in parallel. No, it doesn't run Solaris. Yes, it uses Linux. No, it doesn't hurt me ... as my colleague Roland Rambau (he knows a lot about HPC) said once to me ... it doesn't matter which OS is staying out of the way of the workload in HPC.

    Read the article

  • mystified by qr.Q(): what is an orthonormal matrix in "compact" form?

    - by gappy
    R has a qr() function, which performs QR decomposition using either LINPACK or LAPACK (in my experience, the latter is 5% faster). The main object returned is a matrix "qr" that contains in the upper triangular matrix R (i.e. R=qr[upper.tri(qr)]). So far so good. The lower triangular part of qr contains Q "in compact form". One can extract Q from the qr decomposition by using qr.Q(). I would like to find the inverse of qr.Q(). In other word, I do have Q and R, and would like to put them in a "qr" object. R is trivial but Q is not. The goal is to apply to it qr.solve(), which is much faster than solve() on large systems.

    Read the article

  • Is there any way to limit the turbo boost speed / intensity on i7 lap?

    - by Anonymous
    I've just got a used i7 laptop, one of these overheating pavilions from HP with quad cores. And I really want to find a compromise between the temp and performance. If I use linpack, or some other heavy benchmark, the temp easily gets to 95+, and having a TJ of 100 Degrees, for a 2630QM model, it really gets me throttling, that no cooling pad or even an industrial fan could solve. I figured later that it is due to turbo boost, and if I set my power settings to use 99% of the CPU instead of 100%, and it seems to disable the turbo boost, so the temp gets better. But then again it loses quite a bit of performance. The regular clock is 2GHz, and in turbo boost it gets to 2.6Ghz, but I just wonder if I could limit it to around 2.3Ghz, that would be a real nice thing. Also there is another question I've hard time getting answer to. It seems to me that clocks are very quickly boosting up to max even when not needed, eg, it's ok if the CPU has 0% load, the clocks get to their 800MHz, but even if it gets to about 5% it quickly jumps to a max and even popping up turbo, which seems very strange to me. So I wonder if there is any way to adjust the sensitivity of the Speed Step feature. I believe it would be more logical to demand increased clock if it hits let's say 50% load. I do understand that most of these features are probably hardwired somewhere in the CPU itself or the MB, which has no tuning options just like on many laptops. But I would appreciate if you could recommend some thing, or some software. Thanks

    Read the article

1