Avoid memory leaks when destroying TLS values on MacOS X #28661

ranma42 · 2015-09-25T15:36:38Z

Depends on #28631.
Fixes the memory fault in #28129 (and makes the problem cause explicit: stdout is being initialised during the thread destruction).

Before this is merged I would like to investigate how the mac-specific hack in destroy_value is related to this (in particular if it is still needed).

Some guidance about what is the best way to use TLS variables in a robust way (in particular in drop implementations) would be very appreciated.
From the reference manual I infer that state should always be checked if the function can be called from a site in which the TLS destructor has already run.
At least in library code this might become very common, so I guess we might either have an idiomatic way to do it (and provide an example in the docs) or even have a safe_with method which provides an Option, so that the not-available (already destroyed/being destroyed) case always needs to be explicitly handled.

Move the panic logging code to a function separate from `on_panic` and simplify the code to decide whether the backtrace should be logged.

Move the panic handling logic from the `unwind` module to `panicking` and use a panic counter to distinguish between normal state, panics and double panics.

The double-panic `abort` is run after the logging code, to provide feedback in case of a double-panic. This means that if the panic logging fails with a panic, the `abort` might never be reached. This should not normally occur, but if the `on_panic` function detects more than 2 panics, aborting *before* logging makes panic handling somewhat more robust, as it avoids an infinite recursion, which would eventually crash the process, but also make the problem harder to debug. This handles the FIXME about what to do if the thread printing panics.

The `destroy_value` function should only be used within the `imp` module, it should not be exported.

The MacOS X TLS implementation does not try tro handle the case in which new thread termination callbacks are registered (using `_tlv_atexit`) while the thread is already terminating. This can cause memory leaks and double-frees, as pointed out in Disallowing the registration of new destructors fixes this issue. As additional bonus, it should also make it possible to simplify `run_dtors` in the Linux implementation of TLS.

Since LOCAL_STDERR is a TLS variable, it is possible that it has already been destroyed when the `on_panic` function is invoked. Instead of re-panicking on its access, protect the panic logging code checking the TLS variable status (as the code is already implicitly doing with `thread_info::current_thread`).

rust-highfive · 2015-09-25T15:37:33Z

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

alexcrichton · 2015-09-25T17:56:01Z

Can this separate out the commits from #28631? It looks like they're not functionally needed and it seems like unrelated functionality perhaps? Would at least make review easier!

alexcrichton · 2015-09-25T17:58:04Z

src/libstd/thread/local.rs

-            if intrinsics::needs_drop::<T>() && self.dtor_running.get() {
-                return None
+            if intrinsics::needs_drop::<T>() {
+                if DTOR_RUNNING.with(|running| { running.get() }) {


Unfortunately this change may be a little too-backwards-incompatible to work out well. I worry that this is "fixing" the problem by papering over some other underlying cause. One example of the breakage is the travis logs indicate a failing test which otherwise passes here.

The change I posted takes a pretty extreme stance: if we are destroying TLS, we are not guaranteed that any TLS is usable.
Conversely the failing test checks that never-used TLS variable can be initialised during the destruction of TLS.

Should we allow never-used TLS variables to be initialised while TLS is being destroyed?
This could be fixed by using the same approach in MacOS X as in Linux (i.e. subscribing a single destructor manager, then keeping a list of dtors to be run).

I'm not sure whether this guarantee is very convenient, as only the variables that have never been used are known to be useable. The other TLS variables available during the teardown of TLS depend on the order in which they are destroyed, which looks very fragile to me.

alexcrichton · 2015-09-25T18:02:26Z

Oh I see what the required piece is for the dependence on the other PR! While certainly a good change to check whether it's destroyed, I'm a little confused what triggered a panic in the first place? The example from #28129 I'd expect to not panic...

ranma42 · 2015-09-26T07:50:44Z

This makes the issue with #28129 somewhat more obvious: http://is.gd/ZO6da8
The problem is that the LOCAL_STDOUT TLS variable is used from the drop implementation (indirectly, it is used because of print!).

In the original code, it was only used in drop, hence everything worked just fine on Linux (never-used TLS variable used during a drop of a TLS variable), but it broke on MacOSX because it registered a new _tlv_atexit callback while already running TLS destruction.
In the new code the LOCAL_STDOUT variable is initialised and destroyed before drop is called, hence the print! in drop "expectedly" fails.

As you can see, it is very hard to write code using TLS variables inside drop in a reliable manner.
Did you expect the example from #28129 and the one I linked to have a different panic/abort behaviour?.
The fragility of TLS usage in on_panic (the need to protect both the thread_info::current and LOCAL_STDERR by checking their state) and the hard-to-predict behaviour make me hope that there is room for improvement in this respect.

This is why I was wondering if instead of guaranteeing that never-used TLS variable can be used during TLS teardown a more robust approach could be to have a safe_with (or something like that) method on TLS variables that is guaranteed to be callable during TLS teardown without ever causing panics.
I would even argue that most functions should use something like this, otherwise calling them from drop implementations is going to be a Bad Thing.
In the current state, panics would happen depending on whether they have already been destroyed.
Post-PR it would still be bad, as they would panic if already destroyed or never-used.

ranma42 · 2015-09-26T08:03:18Z

I might even go as far saying that the (possibly somewhat inconvenient) robust/safe way to constrain the spec is: no TLS variable is accessible through with during TLS teardown; any call to with will panic, even for already-initialised, not-yet-destroyed variables.

The destruction of TLS variable would still be possible (self would be available in drop).
From my point of view, the safe_with method could simply be a way to enforce this behaviour.
Otherwise, we could try to ensure (at least in std) that all the with calls are protected by a state check (this would not be needed only for code which we know to be unreachable from TLS dtors).
Providing the same kind of guarantee on external code is currently impossible, since state is unstable.
Moreover, I believe that even if it was stable, it would be very hard to guarantee that the check is not forgotten wherever it is appropriate.

Uhm... I realised that my comments might look more like material for a discussion.
Shall I close this PR and open an RFC?

alexcrichton · 2015-09-28T16:50:35Z

Whether or not this should have an RFC depends on how much of an invasive change this turns out to be. New methods probably want to go through an RFC, but minor tweaks to existing ones are fine to leave around.

Thanks for the explanation about where the panic comes from, makes sense! I would personally be wary of a safe_with-sort-of-method because that was the intended purpose of the state accessor where the state of TLS can be queried to avoid panicking in situations (such as when we're already panicking).

I'm worried about blanket disallowing reinitilization during destruction mainly because there doesn't seem to be a technical reason to do so other than "weird stuff happens otherwise", and I'd prefer to have a better understand of what's happening under the hood before moving forward. My intention in opening the issue was to just document what was happening and track behavior like the corruption on OSX. I don't have a strong opinion one way or another how it should be handled, and I'm personally ok with weird behavior so long as there's no unsafety (followed up by documenting how to avoid weird behavior).

alexcrichton · 2015-10-21T21:17:02Z

Closing due to inactivity, but feel free to submit an RFC or reopen this as a new PR!

ranma42 added 6 commits September 24, 2015 23:45

Separate panic logging code

44d1b14

Move the panic logging code to a function separate from `on_panic` and simplify the code to decide whether the backtrace should be logged.

Explicitly count the number of panics

c7b8490

Move the panic handling logic from the `unwind` module to `panicking` and use a panic counter to distinguish between normal state, panics and double panics.

Restrict visibility of destroy_value

cef1477

The `destroy_value` function should only be used within the `imp` module, it should not be exported.

rust-highfive assigned brson Sep 25, 2015

alexcrichton reviewed Sep 25, 2015
View reviewed changes

alexcrichton assigned alexcrichton and unassigned brson Sep 28, 2015

alexcrichton closed this Oct 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid memory leaks when destroying TLS values on MacOS X #28661

Avoid memory leaks when destroying TLS values on MacOS X #28661

ranma42 commented Sep 25, 2015

rust-highfive commented Sep 25, 2015

alexcrichton commented Sep 25, 2015

alexcrichton Sep 25, 2015

ranma42 Sep 26, 2015

alexcrichton commented Sep 25, 2015

ranma42 commented Sep 26, 2015

ranma42 commented Sep 26, 2015

alexcrichton commented Sep 28, 2015

alexcrichton commented Oct 21, 2015

Avoid memory leaks when destroying TLS values on MacOS X #28661

Avoid memory leaks when destroying TLS values on MacOS X #28661

Conversation

ranma42 commented Sep 25, 2015

rust-highfive commented Sep 25, 2015

alexcrichton commented Sep 25, 2015

alexcrichton Sep 25, 2015

Choose a reason for hiding this comment

ranma42 Sep 26, 2015

Choose a reason for hiding this comment

alexcrichton commented Sep 25, 2015

ranma42 commented Sep 26, 2015

ranma42 commented Sep 26, 2015

alexcrichton commented Sep 28, 2015

alexcrichton commented Oct 21, 2015