I have some multithreaded c++ code with the following structure:
do_thread_specific_work();
update_shared_variables();
//checkpoint A
do_thread_specific_work_not_modifying_shared_variables();
//checkpoint B
do_thread_specific_work_requiring_all_threads_have_updated_shared_variables();
What follows checkpoint B is work that could have started if all threads have reached only checkpoint A, hence my notion of a "soft barrier".
Typically multithreading libraries only provide "hard barriers" in which all threads must reach some point before any can continue. Obviously a hard barrier could be used at checkpoint B.
Using a soft barrier can lead to better execution time, especially since the work between checkpoints A and B may not be load-balanced between the threads (i.e. 1 slow thread who has reached checkpoint A but not B could be causing all the others to wait at the barrier just before checkpoint B).
I've tried using atomics to synchronize things and I know with 100% certainty that is it NOT guaranteed to work. For example using openmp syntax, before the parallel section start with:
shared_thread_counter = num_threads; //known at compile time
#pragma omp flush
Then at checkpoint A:
#pragma omp atomic
shared_thread_counter--;
Then at checkpoint B (using polling):
#pragma omp flush
while (shared_thread_counter > 0) {
usleep(1); //can be removed, but better to limit memory bandwidth
#pragma omp flush
}
I've designed some experiments in which I use an atomic to indicate that some operation before it is finished. The experiment would work with 2 threads most of the time but consistently fail when I have lots of threads (like 20 or 30). I suspect this is because of the caching structure of modern CPUs. Even if one thread updates some other value before doing the atomic decrement, it is not guaranteed to be read by another thread in that order. Consider the case when the other value is a cache miss and the atomic decrement is a cache hit.
So back to my question, how to CORRECTLY implement this "soft barrier"? Is there any built-in feature that guarantees such functionality? I'd prefer openmp but I'm familiar with most of the other common multithreading libraries.
As a workaround right now, I'm using a hard barrier at checkpoint B and I've restructured my code to make the work between checkpoint A and B automatically load-balancing between the threads (which has been rather difficult at times).
Thanks for any advice/insight :)