Skip to content

Avoid memory leaks when destroying TLS values on MacOS X #28661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

ranma42
Copy link
Contributor

@ranma42 ranma42 commented Sep 25, 2015

Depends on #28631.
Fixes the memory fault in #28129 (and makes the problem cause explicit: stdout is being initialised during the thread destruction).

Before this is merged I would like to investigate how the mac-specific hack in destroy_value is related to this (in particular if it is still needed).

Some guidance about what is the best way to use TLS variables in a robust way (in particular in drop implementations) would be very appreciated.
From the reference manual I infer that state should always be checked if the function can be called from a site in which the TLS destructor has already run.
At least in library code this might become very common, so I guess we might either have an idiomatic way to do it (and provide an example in the docs) or even have a safe_with method which provides an Option, so that the not-available (already destroyed/being destroyed) case always needs to be explicitly handled.

Move the panic logging code to a function separate from `on_panic` and
simplify the code to decide whether the backtrace should be logged.
Move the panic handling logic from the `unwind` module to `panicking`
and use a panic counter to distinguish between normal state, panics
and double panics.
The double-panic `abort` is run after the logging code, to provide
feedback in case of a double-panic. This means that if the panic
logging fails with a panic, the `abort` might never be reached.

This should not normally occur, but if the `on_panic` function detects
more than 2 panics, aborting *before* logging makes panic handling
somewhat more robust, as it avoids an infinite recursion, which would
eventually crash the process, but also make the problem harder to
debug.

This handles the FIXME about what to do if the thread printing panics.
The `destroy_value` function should only be used within the `imp`
module, it should not be exported.
The MacOS X TLS implementation does not try tro handle the case in
which new thread termination callbacks are registered (using
`_tlv_atexit`) while the thread is already terminating.

This can cause memory leaks and double-frees, as pointed out in

Disallowing the registration of new destructors fixes this issue.
As additional bonus, it should also make it possible to simplify
`run_dtors` in the Linux implementation of TLS.
Since LOCAL_STDERR is a TLS variable, it is possible that it has
already been destroyed when the `on_panic` function is invoked.

Instead of re-panicking on its access, protect the panic logging code
checking the TLS variable status (as the code is already implicitly
doing with `thread_info::current_thread`).
@rust-highfive
Copy link
Contributor

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member

Can this separate out the commits from #28631? It looks like they're not functionally needed and it seems like unrelated functionality perhaps? Would at least make review easier!

if intrinsics::needs_drop::<T>() && self.dtor_running.get() {
return None
if intrinsics::needs_drop::<T>() {
if DTOR_RUNNING.with(|running| { running.get() }) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this change may be a little too-backwards-incompatible to work out well. I worry that this is "fixing" the problem by papering over some other underlying cause. One example of the breakage is the travis logs indicate a failing test which otherwise passes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change I posted takes a pretty extreme stance: if we are destroying TLS, we are not guaranteed that any TLS is usable.
Conversely the failing test checks that never-used TLS variable can be initialised during the destruction of TLS.

Should we allow never-used TLS variables to be initialised while TLS is being destroyed?
This could be fixed by using the same approach in MacOS X as in Linux (i.e. subscribing a single destructor manager, then keeping a list of dtors to be run).

I'm not sure whether this guarantee is very convenient, as only the variables that have never been used are known to be useable. The other TLS variables available during the teardown of TLS depend on the order in which they are destroyed, which looks very fragile to me.

@alexcrichton
Copy link
Member

Oh I see what the required piece is for the dependence on the other PR! While certainly a good change to check whether it's destroyed, I'm a little confused what triggered a panic in the first place? The example from #28129 I'd expect to not panic...

@ranma42
Copy link
Contributor Author

ranma42 commented Sep 26, 2015

This makes the issue with #28129 somewhat more obvious: http://is.gd/ZO6da8
The problem is that the LOCAL_STDOUT TLS variable is used from the drop implementation (indirectly, it is used because of print!).

In the original code, it was only used in drop, hence everything worked just fine on Linux (never-used TLS variable used during a drop of a TLS variable), but it broke on MacOSX because it registered a new _tlv_atexit callback while already running TLS destruction.
In the new code the LOCAL_STDOUT variable is initialised and destroyed before drop is called, hence the print! in drop "expectedly" fails.

As you can see, it is very hard to write code using TLS variables inside drop in a reliable manner.
Did you expect the example from #28129 and the one I linked to have a different panic/abort behaviour?.
The fragility of TLS usage in on_panic (the need to protect both the thread_info::current and LOCAL_STDERR by checking their state) and the hard-to-predict behaviour make me hope that there is room for improvement in this respect.

This is why I was wondering if instead of guaranteeing that never-used TLS variable can be used during TLS teardown a more robust approach could be to have a safe_with (or something like that) method on TLS variables that is guaranteed to be callable during TLS teardown without ever causing panics.
I would even argue that most functions should use something like this, otherwise calling them from drop implementations is going to be a Bad Thing.
In the current state, panics would happen depending on whether they have already been destroyed.
Post-PR it would still be bad, as they would panic if already destroyed or never-used.

@ranma42
Copy link
Contributor Author

ranma42 commented Sep 26, 2015

I might even go as far saying that the (possibly somewhat inconvenient) robust/safe way to constrain the spec is: no TLS variable is accessible through with during TLS teardown; any call to with will panic, even for already-initialised, not-yet-destroyed variables.

The destruction of TLS variable would still be possible (self would be available in drop).
From my point of view, the safe_with method could simply be a way to enforce this behaviour.
Otherwise, we could try to ensure (at least in std) that all the with calls are protected by a state check (this would not be needed only for code which we know to be unreachable from TLS dtors).
Providing the same kind of guarantee on external code is currently impossible, since state is unstable.
Moreover, I believe that even if it was stable, it would be very hard to guarantee that the check is not forgotten wherever it is appropriate.

Uhm... I realised that my comments might look more like material for a discussion.
Shall I close this PR and open an RFC?

@alexcrichton
Copy link
Member

Whether or not this should have an RFC depends on how much of an invasive change this turns out to be. New methods probably want to go through an RFC, but minor tweaks to existing ones are fine to leave around.

Thanks for the explanation about where the panic comes from, makes sense! I would personally be wary of a safe_with-sort-of-method because that was the intended purpose of the state accessor where the state of TLS can be queried to avoid panicking in situations (such as when we're already panicking).

I'm worried about blanket disallowing reinitilization during destruction mainly because there doesn't seem to be a technical reason to do so other than "weird stuff happens otherwise", and I'd prefer to have a better understand of what's happening under the hood before moving forward. My intention in opening the issue was to just document what was happening and track behavior like the corruption on OSX. I don't have a strong opinion one way or another how it should be handled, and I'm personally ok with weird behavior so long as there's no unsafety (followed up by documenting how to avoid weird behavior).

@alexcrichton alexcrichton assigned alexcrichton and unassigned brson Sep 28, 2015
@alexcrichton
Copy link
Member

Closing due to inactivity, but feel free to submit an RFC or reopen this as a new PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants