Skip to content

GH-118095: Make invalidating and clearing executors memory safe #118459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 1, 2024

Conversation

markshannon
Copy link
Member

@markshannon markshannon commented May 1, 2024

This fixes a crash when an executor forms a reference cycle with itself.
This bug has been present for a while, but was only exposed by #118420.

Also fixes another latent bug in _Py_Executors_InvalidateDependency, where clearing one executor could clear another leaving the iteration variable pointing to freed memory.

The three main changes amongst the main changes are:

  • Separating the two operations of detaching an executor from a code object and clearing executors. Freeing a code object should detach the executors, but not necessarily clear them.
  • Making sure that an executor is not freed in the middle of clearing it
  • Fixing the use-after-free bug in _Py_Executors_InvalidateDependency
  • Only convert RESUME to RESUME_CHECK: Converting ENTER_EXECUTOR to RESUME_CHECK would loose the executor.

@markshannon
Copy link
Member Author

There are no new tests as I can't see a way to test this, especially the out-of-memory checks.

const _PyExecutorObject *side = executor->exits[i].executor;
executor->exits[i].temperature = initial_unreachable_backoff_counter();
if (side != cold) {
executor->exits[i].executor = cold;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the executor owns the reference to exits[i].executor when that is not equal to the cold exit, but not when it is equal. That is potentially confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It owns a reference in both cases, it is just that cold exit executors are immortal, so we can be lose with the refcounting.

_Py_ExecutorClear(exec);
unlink_executor(exec);
if (no_memory) {
exec->vm_data.valid = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really ok to just skip clearing exec in this case?

Copy link
Member Author

@markshannon markshannon May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never execute an executor with valid == 0, so it does what says and invalidates the dependent executor.

It might delay later re-optimizations and defer reclaiming some objects.
We've hit a memory error at this point, so the program is probably dying anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've hit a memory error at this point, so the program is probably dying anyway.

That feels like a dangerous assumption -- an app may have a gigabyte array of temp data somewhere that it can free and retry.

@markshannon
Copy link
Member Author

The 32 bit Windows Jit build failure is probably unrelated as it has been flaky lately.

@markshannon markshannon merged commit f6fab21 into python:main May 1, 2024
52 of 53 checks passed
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot ARM64 MacOS M1 Refleaks NoGIL 3.x has failed when building commit f6fab21.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/1368/builds/882) and take a look at the build logs.
  4. Check if the failure is related to this commit (f6fab21) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/1368/builds/882

Failed tests:

  • test_threading_local
  • test_threading
  • test.test_concurrent_futures.test_thread_pool

Failed subtests:

  • test_interrupt_main_subthread - test.test_threading.InterruptMainTests.test_interrupt_main_subthread
  • test_free_reference - test.test_concurrent_futures.test_thread_pool.ThreadPoolExecutorTest.test_free_reference

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/test/test_concurrent_futures/executor.py", line 131, in test_free_reference
    self.assertIsNone(wr())
    ~~~~~~~~~~~~~~~~~^^^^^^
AssertionError: <test.test_concurrent_futures.executor.MyObject object at 0x20006070090> is not None


Traceback (most recent call last):
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/test/test_threading.py", line 2035, in test_interrupt_main_subthread
    t.start()
    ~~~~~~~^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 977, in start
    self._started.wait()  # Will set ident and native_id
    ~~~~~~~~~~~~~~~~~~^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 656, in wait
    with self._cond:
    ...<3 lines>...
        return signaled
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 307, in __exit__
    return self._lock.__exit__(*args)
           ~~~~~~~~~~~~~~~~~~~^^^^^^^
RuntimeError: release unlocked lock


Traceback (most recent call last):
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 659, in wait
    signaled = self._cond.wait(timeout)
               ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 368, in wait
    self._acquire_restore(saved_state)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/Users/ec2-user/buildbot/buildarea/3.x.itamaro-macos-arm64-aws.macos-with-brew.refleak.nogil/build/Lib/threading.py", line 315, in _acquire_restore
    def _acquire_restore(self, x):
    
KeyboardInterrupt

no_memory = true;
exec->vm_data.valid = 0;
}
}
if (is_invalidation) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any calls where this flag is passed as 0, so maybe the flag argument is no longer needed?

@markshannon markshannon deleted the safe-executor-dealloc branch August 6, 2024 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants