-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I observe non-deterministic output when calling into cblas_sgemv concurrently from two different threads each of which are bound to a different copy of an OpenMP runtime in the same process. I built OpenBLAS 0.3.5 with USE_OPENMP=1, USE_THREAD=1, NUM_PARALLEL=2, and NUM_THREADS=40. Each copy of the OpenMP runtime has 20 threads in its thread pool, and I call 'openblas_set_num_threads' to 20 once before doing any other OpenBLAS calls (which with 2 copies of the OpenMP runtime results in 40 total threads, the same as NUM_THREADS). Everything is deterministic if I take care to ensure that one thread completes its call to cblas_sgemv before the second thread does its call cblas_sgemv. However, if both threads call into cblas_sgemv concurrently each with their own OpenMP runtime, then non-determinism results. My guess is that there is one or more global variables somewhere in OpenBLAS that is/are suffering from a race with concurrent invocations each with OpenMP but I've been so far unable to find it in the source.
Note you can't reproduce this with any of the standard OpenMP runtimes (e.g. the ones shipped by GCC, LLVM, or Intel as they all have global variables that prevent multiple copies of their OpenMP runtimes from existing concurrently in the same process). I'm using the copy of OpenMP provided by the Realm runtime which does support concurrent copies of the OpenMP runtime existing in the same process:
https://github.com/StanfordLegion/legion/blob/stable/runtime/realm/openmp/openmp_api.cc
I realize this is a very unusual use case for you to consider, but I'm hoping that it's an easy fix to just get rid of one or a few global variables somewhere.