Skip to content

"import numpy" leaks - reported by valgrind #18271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bstaletic opened this issue Jan 31, 2021 · 25 comments
Closed

"import numpy" leaks - reported by valgrind #18271

bstaletic opened this issue Jan 31, 2021 · 25 comments

Comments

@bstaletic
Copy link

bstaletic commented Jan 31, 2021

Reproducing code example:

PYTHONMALLOC=malloc valgrind --leak-check=full --show-leak-kinds=definite --suppressions=pybind11/tests/valgrind-python.supp --gen-suppressions=all python3.9-dbg -c "import numpy; print(numpy.__version__)"

Error message:

The output is fairly long, so I made a gist: https://gist.github.com/bstaletic/061ea8912ed5bc4e238363a120f70a1c
The suppression file is in the gist as well.

NumPy/Python version information:

Python is 3.9.1, debug version, with assertions
NumPy is 1.20.0, but these leaks were present on 1.19.x as well. I haven't checked previous versions, because that's when pybind11 incorporated valgrind in its CI.

@seberg
Copy link
Member

seberg commented Jan 31, 2021

@bstaletic Yes, NumPy leaks memory on import, but only once. Can you explain if this is a problem for you? I do check numpy with valgrind, but there are certain things during Python and NumPy import that simply are designed in a way that they must "leak" (exactly once).

@bstaletic
Copy link
Author

It's not exactly a problem, as pybind11 can keep suppressing the valgrind-reported errors. I assumed you weren't aware of these and thus opened the bug report.
What actually lead me to open the bug report was pybind/pybind11#2837. By switching to the latest version of NumPy, pybind11 ran into a new leak. This one isn't shown in the gist, but the details are all in that PR. I'm not sure where is it coming from, but this is the new stack trace that valgrind reports.

==12516== 80 bytes in 1 blocks are indirectly lost in loss record 2,861 of 4,401
==12516==    at 0x4C31ECB: malloc (vg_replace_malloc.c:307)
==12516==    by 0x467566: _PyMem_RawMalloc (obmalloc.c:99)
==12516==    by 0x468168: PyObject_Malloc (obmalloc.c:685)
==12516==    by 0x52CE7B: _PyObject_GC_Alloc (gcmodule.c:2225)
==12516==    by 0x52CE47: _PyObject_GC_Malloc (gcmodule.c:2252)
==12516==    by 0x52CFB3: _PyObject_GC_NewVar (gcmodule.c:2281)
==12516==    by 0x473BD1: tuple_alloc (tupleobject.c:92)
==12516==    by 0x473CA4: _PyTuple_FromArray (tupleobject.c:420)
==12516==    by 0x42DE4D: _PyObject_MakeTpCall (call.c:165)
==12516==    by 0x42EDBC: _PyObject_VectorcallTstate (abstract.h:116)
==12516==    by 0x42EDBC: _PyObject_CallFunctionVa (call.c:542)
==12516==    by 0x42FA4F: PyObject_CallFunction (call.c:564)
==12516==    by 0x4F648C: PyImport_Import (import.c:2073)

To be clear, if you say this is fine and expected, I'm fine with that.

@seberg
Copy link
Member

seberg commented Jan 31, 2021

There were always some leaks that seemed fine, but now I notice that all of these are in Cython code generated for numpy random (and not in NumPy proper). To be honest, I stopped checking import time leaks, but I suppose there are fewer nowadays (at least few "definite" ones that valgrind notices, there are definitely more leaks :)).

I think you are right, we actually should see if Cython generates incorrect reference counting or so.

That last one could be a serious refcounting bug somewhere. I had run valgrind over NumPy 1.20 a while ago, but there were a few changes after that, and the refcount check was a bit longer, I guess. Should plan to do more full runs again soon before 1.20.1, just to be sure nothing happened in the mean time.

Can you give more information on that leak (e.g. what code is run when it is triggered?). Tracking an incorrectly refcounted tuple seems very hard otherwise (unless I get lucky and the NumPy test suit finds it as well).

@seberg
Copy link
Member

seberg commented Jan 31, 2021

@bstaletic one small thing, if you this inside one of the many tests. Just in case it helps my little pytest plugin: https://pypistats.org/packages/pytest-valgrind could help with finding which test is failing exactly. (The github repo is in a slightly newer state, but it shouldn't matter much.)

@YannickJadoul
Copy link
Contributor

Thanks for looking into this, @seberg!

To add to @bstaletic's story: we managed to run Valgrind on pybind11 and clean pybind11 of all reported leaks, a month ago, and suppressed the leaks reported on import numpy as "these aren't ours". But now our tests failed after updating NumPy (because of the one extra leak), and we though we should finally take the time to report upstream.

Here's how I reproduced the new leak that @bstaletic reported: PYTHONMALLOC=malloc valgrind --leak-check=full --show-leak-kinds=definite,indirect --suppressions=pybind11/tests/valgrind-python.supp --gen-suppressions=all python3.9-dbg -c "import numpy, sys; print(numpy.__version__); print(sys.version); print(sys.abiflags)"
Full output here: numpy-leaks-by-valgrind.txt

Suppressing the error with our current suppression file, this one remains:

$ PYTHONMALLOC=malloc valgrind --leak-check=full --show-leak-kinds=definite,indirect --suppressions=pybind11/tests/valgrind-python.supp --suppressions=pybind11/tests/valgrind-numpy-scipy.supp --gen-suppressions=all python3.9-dbg -c "import numpy, sys; print(numpy.__version__); print(sys.version); print(sys.abiflags)"
==12620== Memcheck, a memory error detector
==12620== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12620== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==12620== Command: python3.9-dbg -c import\ numpy,\ sys;\ print(numpy.__version__);\ print(sys.version);\ print(sys.abiflags)
==12620== 
1.20.0
3.9.1 (default, Dec  8 2020, 03:24:52) 
[GCC 7.5.0]
d
==12620== 
==12620== HEAP SUMMARY:
==12620==     in use at exit: 3,340,248 bytes in 20,455 blocks
==12620==   total heap usage: 316,728 allocs, 296,273 frees, 47,691,463 bytes allocated
==12620== 
==12620== 80 bytes in 1 blocks are indirectly lost in loss record 2,861 of 4,401
==12620==    at 0x4C31ECB: malloc (vg_replace_malloc.c:307)
==12620==    by 0x467566: _PyMem_RawMalloc (obmalloc.c:99)
==12620==    by 0x468168: PyObject_Malloc (obmalloc.c:685)
==12620==    by 0x52CE7B: _PyObject_GC_Alloc (gcmodule.c:2225)
==12620==    by 0x52CE47: _PyObject_GC_Malloc (gcmodule.c:2252)
==12620==    by 0x52CFB3: _PyObject_GC_NewVar (gcmodule.c:2281)
==12620==    by 0x473BD1: tuple_alloc (tupleobject.c:92)
==12620==    by 0x473CA4: _PyTuple_FromArray (tupleobject.c:420)
==12620==    by 0x42DE4D: _PyObject_MakeTpCall (call.c:165)
==12620==    by 0x42EDBC: _PyObject_VectorcallTstate (abstract.h:116)
==12620==    by 0x42EDBC: _PyObject_CallFunctionVa (call.c:542)
==12620==    by 0x42FA4F: PyObject_CallFunction (call.c:564)
==12620==    by 0x4F648C: PyImport_Import (import.c:2073)
==12620== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: indirect
   fun:malloc
   fun:_PyMem_RawMalloc
   fun:PyObject_Malloc
   fun:_PyObject_GC_Alloc
   fun:_PyObject_GC_Malloc
   fun:_PyObject_GC_NewVar
   fun:tuple_alloc
   fun:_PyTuple_FromArray
   fun:_PyObject_MakeTpCall
   fun:_PyObject_VectorcallTstate
   fun:_PyObject_CallFunctionVa
   fun:PyObject_CallFunction
   fun:PyImport_Import
}
==12620== LEAK SUMMARY:
==12620==    definitely lost: 0 bytes in 0 blocks
==12620==    indirectly lost: 80 bytes in 1 blocks
==12620==      possibly lost: 2,854,181 bytes in 17,746 blocks
==12620==    still reachable: 377,931 bytes in 1,131 blocks
==12620==         suppressed: 108,056 bytes in 1,577 blocks
==12620== Reachable blocks (those to which a pointer was found) are not shown.
==12620== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==12620== 
==12620== For lists of detected and suppressed errors, rerun with: -s
==12620== ERROR SUMMARY: 4007 errors from 4007 contexts (suppressed: 189 from 189)

@YannickJadoul
Copy link
Contributor

Oh, I should specify one more thing, perhaps: we were holding back on having more general rules (e.g., just the stack up to fun:tuple_alloc), because there doesn't seem to be a reference to NumPy in there, and we didn't want to potentially mask leaks not caused by NumPy.

Apart from that, we're perfectly happy. The main goal is to know what to suppress such that we can have Valgrind run in CI for pybind11.

@seberg
Copy link
Member

seberg commented Jan 31, 2021

Honestly, I don't use any suppression files at all, so :). The only reason we don't have a CI run in NumPy is that there are some tests that always have to leak, and running most of the test suit with pytest-valgrind at least (its slower, because it checks for leaks after every tests and not just at exist), takes about 24 hours...

Its great that you guys do these checks! I am a little stumped, working of master, but:

sebastian@seberg-x1 ~/forks/numpy-master
 % PYTHONMALLOC=malloc valgrind --leak-check=full --show-leak-kinds=definite python3.9-dbg runtests.py --python
==14334== Memcheck, a memory error detector
==14334== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==14334== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==14334== Command: python3.9-dbg runtests.py --python
==14334== 
Building, see build.log...
==14335== Warning: invalid file descriptor 1024 in syscall close()
==14335== Warning: invalid file descriptor 1025 in syscall close()
==14335== Warning: invalid file descriptor 1026 in syscall close()
==14335== Warning: invalid file descriptor 1027 in syscall close()
==14335==    Use --log-fd=<number> to select an alternative log fd.
==14335== Warning: invalid file descriptor 1028 in syscall close()
==14335== Warning: invalid file descriptor 1029 in syscall close()
Build OK
Enabling display of all warnings
Python 3.9.1+ (default, Jan 20 2021, 14:49:22) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import numpy, sys; print(numpy.__version__); print(sys.version); print(sys.abiflags)
1.21.0.dev0+586.gf55f5d455
3.9.1+ (default, Jan 20 2021, 14:49:22) 
[GCC 10.2.1 20210110]
d
>>> 
now exiting InteractiveConsole...
==14334== 
==14334== HEAP SUMMARY:
==14334==     in use at exit: 3,570,926 bytes in 20,808 blocks
==14334==   total heap usage: 374,075 allocs, 353,267 frees, 55,757,282 bytes allocated
==14334== 
==14334== LEAK SUMMARY:
==14334==    definitely lost: 0 bytes in 0 blocks
==14334==    indirectly lost: 0 bytes in 0 blocks
==14334==      possibly lost: 2,974,833 bytes in 19,418 blocks
==14334==    still reachable: 596,093 bytes in 1,390 blocks
==14334==         suppressed: 0 bytes in 0 blocks
==14334== Reachable blocks (those to which a pointer was found) are not shown.
==14334== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==14334== 
==14334== For lists of detected and suppressed errors, rerun with: -s
==14334== ERROR SUMMARY: 4228 errors from 4228 contexts (suppressed: 0 from 0)

Although let me try without the interactive shell... Maybe that does something weird.

@seberg
Copy link
Member

seberg commented Jan 31, 2021

Yeah, same running from the shell directly; no reported leaks with just import (and version printing). Maybe the cython version matters?

>>> import cython
>>> cython.__version__
'0.29.21'

@bstaletic
Copy link
Author

one small thing, if you this inside one of the many tests. Just in case it helps my little pytest plugin: https://pypistats.org/packages/pytest-valgrind could help with finding which test is failing exactly.

We have actually stumbled across that plugin when we were first setting valgrind up with pybind11. It looked interesting, but so far we decided not to rely on it, because of the inability to report errors that happen at start up or shut down. It would have had certainly come in handy a few times by now, but habits, I guess. I'm used to just commenting out a bunch of tests until I pin point where the error is coming from, which is the caveman method.

@seberg
Copy link
Member

seberg commented Jan 31, 2021

I did that first, when cleaning out NumPy: manual bisecting (and in the end, once you hit the individual test level, it is often necessary in any case). Then I wrote that, thinking it is a bit better ;).
The plugin just flushes all initial errors right now, I happy to change that if it helps you. Just an offer if you think it might be useful. Aside from narrowing down tests, it doesn't actually do much.

@YannickJadoul
Copy link
Contributor

Yeah, same running from the shell directly; no reported leaks with just import (and version printing). Maybe the cython version matters?

>>> import cython
>>> cython.__version__
'0.29.21'
$ python3.9-dbg
Python 3.9.1 (default, Dec  8 2020, 03:24:52) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cython
>>> cython.__version__
'0.26.1'

That seems to be the same?

Besides, I believe I really just installed the wheel from PyPI. I'm also using deadsnakes' python3.9-dbg (which I suppose has a bunch of Debian patches), and so is our CI. Can that be a factor playing into this?

@YannickJadoul
Copy link
Contributor

The plugin just flushes all initial errors right now, I happy to change that if it helps you. Just an offer if you think it might be useful. Aside from narrowing down tests, it doesn't actually do much.

I must admit I was intrigued by the plugin :-) Given that we're now running on CI, it should be easy enough to spot the test that changed and (probably) causes the leak, but I'd be happy to give that a try!

But yes, as @bstaletic says, pybind11 has a bunch of internals so we should definitely check the first import of a pybind11 module (outside of a pytest test function, I'm afraid) and the final cleanup on interpreter shutdown.

@bstaletic
Copy link
Author

Besides, I believe I really just installed the wheel from PyPI. I'm also using deadsnakes' python3.9-dbg (which I suppose has a bunch of Debian patches), and so is our CI. Can that be a factor playing into this?

I can repro with my custom built, non-patched, CPython.

  1. python -m venv v
  2. . ./v/bin/activate
  3. which pip - just to confirm everything's as expected. Virtual environments make me paranoid, after staring at CI so much.
  4. PYTHONMALLOC=malloc valgrind --leak-check=full --show-leak-kinds=definite,indirect python -c import\ numpy
  5. deactivate

Here's the summary:

==6236== LEAK SUMMARY:
==6236==    definitely lost: 1,936 bytes in 19 blocks
==6236==    indirectly lost: 520 bytes in 8 blocks
==6236==      possibly lost: 2,958,818 bytes in 19,291 blocks
==6236==    still reachable: 404,409 bytes in 1,201 blocks
==6236==         suppressed: 0 bytes in 0 blocks
==6236== Reachable blocks (those to which a pointer was found) are not shown.

That includes the new PyImport_Import one.

@seberg
Copy link
Member

seberg commented Jan 31, 2021

Thanks @bstaletic, I can reproduce it with the wheel, did not yet try to check the 1.20.x branch, but master did not diverge a whole lot.
@mattip what cython version did we use for the wheel build? I still feel the cython modules are the most likely cause.

@seberg
Copy link
Member

seberg commented Jan 31, 2021

Nevermind, I can reproduce the leak on the 1.20 branch, just not on master branch (I am super surprised, because I would think the cython generation is pretty much unchanged; so maybe have to triple check).

Note sure what relevant change happened in between, but at least I can reproduce it now. I also doubt it is a big issue (since it is on import only probably). But I would like it if you guys wouldn't even need a numpy suppression file :).

@YannickJadoul
Copy link
Contributor

But I would like it if you guys wouldn't even need a numpy suppression file :).

That would be amazing! But if you just tell us "these are the leaks you need to live with and you don't need to report similar ones in the future", we're also totally OK with that :-)

@seberg
Copy link
Member

seberg commented Jan 31, 2021

@YannickJadoul, OK: I am very certain, there is no need to worry about them. But I suspect they will go away again (whether due to a cython upgrade or otherwise).

For now, I have figured out that if I compile NumPy with a debug version (which sets some debug paths also in NumPy itself), it doesn't show up. But if I compile with plain python3.9 it does show up.
So the "difference" between master and 1.20, was actually just me compiling with python3.9-dbg runtests.py before. I have tried with the Cython 3 pre-release, and it doesn't make a difference.

For today, I will stop looking into this. I hope there is some small, simple fix in Cython that would make this go away, so will look into it a bit more in the next days and hopefully find it. As this appears in module init, I am not worried that it is problematic.

@YannickJadoul
Copy link
Contributor

Thanks for confirming. We'll stick to the current suppression files, then, and hopefully in some future, we try without and find a beautiful world without suppression files ;-)

@YannickJadoul
Copy link
Contributor

Thanks for all your effort, @seberg!!

@seberg
Copy link
Member

seberg commented Feb 1, 2021

OK, I had another look. The reason that I couldn't reproduce it at first is that it is related to cython's CYTHON_COMPILING_IN_LIMITED_API (I assume that the debug build disables it – pypy build would also), not compiling it with the that flag, will ensure that valgrind cannot consider it "definitely lost" because the module is never "cleaned up" and a pointer will be kept until the end.
On the up-side: That proofs that the leak is really just one tuple, and it doesn't really matter much.

Now I find it a bit strange that is only one tuple is lost per file, and I do not really see where there might be a DECREF missing (unless we actually have an import error that gets eaten later). I am a bit wondering if Pythons tuple free-list could be involved, but I expect it should be cleared out correctly (and even if not, should probably not show up as lost memory in valgrind.)

@seberg
Copy link
Member

seberg commented Feb 1, 2021

I have convinced myself there is nothing actionable in NumPy, so opened a cython issue: cython/cython#3994 just in case. My hope is that with some new Python or Cython version, this just goes away. I feel this has to do with module cleanup, and I am not even certain Python itself is cleaned up in that regard yet.

@seberg seberg closed this as completed Feb 1, 2021
@bstaletic
Copy link
Author

@seberg Thanks for investigating!

I feel this has to do with module cleanup, and I am not even certain Python itself is cleaned up in that regard yet.

CPython definitely isn't clean, but there are efforts to clean it up in some future version. Maybe one day...

@YannickJadoul
Copy link
Contributor

Thanks again, @seberg!

@seberg
Copy link
Member

seberg commented Apr 15, 2021

As pointed out on the cython issue, you can get rid of these "not really leaks" by using cython with --cleanup=3 (or similar), the cython generate_cleanup_code option that I was not aware of. (the easiest way I used to try is patching tools/cythonize.py, which is a bit annoying though.)

@YannickJadoul
Copy link
Contributor

Thank you for the update, @seberg!

I think you're pretty right when stating in cython/cython#3994 ;-)

(although I guess they are likely happier to add some suppressions compared to patching numpy).

Great to have pinpointed the actual issue, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants