I'm trying to understand limits to parallelization on a 48-core system (4xAMD Opteron 6348, 2.8 Ghz, 12 cores per CPU). I wrote this tiny OpenMP code to test the speedup in what I thought would be the best possible situation (the task is embarrassingly parallel):
// Compile with: gcc scaling.c -std=c99 -fopenmp -O3
#include <stdio.h>
#include <stdint.h>
int main(){
const uint64_t umin=1;
const uint64_t umax=10000000000LL;
double sum=0.;
#pragma omp parallel for reduction(+:sum)
for(uint64_t u=umin; u<umax; u++)
sum+=1./u/u;
printf("%e\n", sum);
}
I was surprised to find that the scaling is highly nonlinear. It takes about 2.9s for the code to run with 48 threads, 3.1s with 36 threads, 3.7s with 24 threads, 4.9s with 12 threads, and 57s for the code to run with 1 thread.
Unfortunately I have to say that there is one process running on the computer using 100% of one core, so that might be affecting it. It's not my process, so I can't end it to test the difference, but somehow I doubt that's making the difference between a 19~20x speedup and the ideal 48x speedup.
To make sure it wasn't an OpenMP issue, I ran two copies of the program at the same time with 24 threads each (one with umin=1, umax=5000000000, and the other with umin=5000000000, umax=10000000000). In that case both copies of the program finish after 2.9s, so it's exactly the same as running 48 threads with a single instance of the program.
What's preventing linear scaling with this simple program?