-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Commit a399d004 breaks proper functioning #2155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Which version of OpenBLAS are you using ? The commit you flagged was during what we now know to be a phase where the thread memory management code was rather fragile, and that version of the code should not even be used by default now (you need to compile with USE_TLS=1 if you want it). (#1742 essentially reverted the default memory.c to its 0.3.0 form (a8002e2) and moved its TLS-based rewrite to the "else" branch of an ifdef) |
The errors appeared when using "latest" last week, but then I checked out tagged releases: v0.3.0 was correct, but all releases v0.3.1 through v0.3.6 show the same error behaviour. After this I simply bisected to find the final commit that introduced the erroneous behavior. I've tried several times, but the commit before a399d00 gives no problems, and this one does, and from what I've seen all commits after keep the errors. In the Makefile.rule
I've seen this before, it happens in about 10% of the test runs I do. |
Yes, the code from a399d00 should not even get compiled. Most likely the problem comes from one of the few other changes to the "traditional" version of memory.c - probably some locking code removal or similar (over)optimization that (re)introduced races. Are you running it with OpenMP or without ? |
There is a magic number 50 of memory areas to be allowed. |
... tied to the maximum number of threads that OpenBLAS was built for, or (in common.h) |
I am thinking of one-off that in supposed now default case the next cell is reached... |
@martin-frbg I am running without OpenMP, as far as I can tell. @brada4 Dan Povey @danpovey found a this max number of threads line, too, on the Kaldi mailing list. In my test I run on a 6-core machine (each core having 2 threads), 6 concurrent jobs. Dan says that a job uses "multiple threads on startup", I don't know if these are threads within kaldi or blasthreads, I suspect the former. I am compiling OpenBLAS with |
@davidavdav, by default those programs will use 8 threads for a start-up job. (IIRC). |
So you are calling a single-threaded OpenBLAS from several threads in parallel ? This scenario looks a lot like #2126 - if you are on (almost) current develop already you could try setting USE_LOCKING=1. |
Kaldi's tools building scrips allow a choice in building OpenBLAS with I always opt for the Now I understand that during start-up of a Kaldi process, some constants are computed from stored models, and this happens multi-threaded. I've recompiled OpenBLAS v0.3.6 with |
Another take at locking: If you do not set NUM_THREADS during build it takes number of cores in build system for threads and MAX(ncores*2,50) fro memory areas. |
v0.3.6 is too old to understand the new USE_LOCKING option, you would need to build a snapshot of the current develop branch - which I (mis)understood you were already using based on your comment above about using "latest" last week. |
Sorry, the problems occurred checking out latest, but inspecting the relog this was probably 3f427c0, 5 weeks ago (which is probably when I started using pykaldi with its own kaldi with its own OpenBLAS---this was probably never renewed because the Kaldi Makefile changes OpenBLAS So now with really latest 26411ac, and Thanks! I suppose I can close this issue, and submit a PR for Kaldi's Makefile. |
@davidavdav |
In a setup with Kaldi (automatic speech recognition) with a pykaldi wrapper I've run into random errors related to OpenBLAS. By bisection of the commit history I found that this regression was introduced in a399d00.
On the Kaldi mailing list there are some more details. Because of the complex setup, it is hard to create an MWE, also because of the random nature of the occurrence of the errors.
The errors and the circumstances under which the errors occur appear to have these ingredients/characterics:
I will try to make these circumstances more explicit with some more experiments, e.g., trying non-concurrent tasks or single thread jobs.
The text was updated successfully, but these errors were encountered: