-
-
Notifications
You must be signed in to change notification settings - Fork 592
Description
Dear ThePhD,
While working with sol::load_result, I hit an amazing combination of what I believe is a little sol bug and a crazy compiler bug in GCC7.5 (sigh). I wouldn't have spent that much time with it, but this is the standard Ubuntu 18.04 compiler and is kind of mainstream... I think the sol bug is easily fixable, but I am providing info on the GCC7.5 bug as well because maybe you have an idea how to address it in sol better, and maybe there are other places affected but I was fortunate to not hit any of them.
Long story short: I think sol::load_result copy constructor and copy assignment operator should be deleted instead of defaulted. It appears that the class has custom move semantics which set the "popcount" to the moved class to 0, whereas the copy semantics doesn't, which results in things being double popped from the stack in very subtle and hard to debug ways, and memory corruption which is very difficult to debug later.
Long story and example below - I am sure you enjoy the occasional C++ craziness during late evening hours, I certainly spent few days digging into this 🤦 . Consider following code snippet:
// Fool GCC7.5 to think that we have true branching in below function
bool condition = false;
std::optional<sol::load_result> LoadScript(sol::state& state)
{
sol::load_result result = state.load("");
if(condition)
{
// This branch makes the RVO non-trivial, and triggers the GCC7.5 bug with copying of load_result
return std::nullopt;
}
return result;
}
sol::state state;
std::optional<sol::load_result> result = LoadScript(state);
// do something with result if result has a value, e.g. cast to function, call function etc.
This is built with C++17 standard set, as the optional implies. On GCC9 the code works perfectly fine. On GCC7.5 it also works, however adding more code around which involves Lua stack makes it crash. I removed that code because this illustrates the problem in a minimal way, and you can see it by adding "printf's" in the constructors, the destructor and the move constructor/operator and observe what happens, if you don't believe my analysis.
So what happens? You probably noticed that the above code doesn't really have an explicit copy anywhere. The function LoadScript performs return value optimization for result, since it is a local variable that doesn't have to be copied and can be moved instead. The difference is within the "how is the RVO being compiled". Here is the behavior for three different setups I tried out with respective behavior:
GCC9 + vanilla sol3.2.1
Behavior: compiler performs RVO for optional result in all branches, thus result is never copied and code works well.
GCC7 + vanilla sol3.2.1
result gets copied into std::optional without RVO, because GCC7 doesn't recognize that sol::load_result is the template type in the std::optional, thus it invokes the copy constructor when putting result's value into the optional return value -> bad things happen later
GCC7 + sol3.2.1 with patch which deletes copy ctor/assignment of load_result
Now that's the best part. I would have expected that deleting the copy constructor would cause compiler errors, because GCC7.5 now can't copy result into std::optional without std::moving it explicitly in the return statement, and everything is rainbows . Surprisingly, the code compiles fine. Moreover, the lua stack is not corrupt now - because the copy constructor is never invoked. The behavior, though, is that std::optional::has_value() now returns false, meaning that the "return result" statement didn't do anything, silently. I think that's a compiler bug in GCC7.5, and I have no idea how to make it cause a compiler error... Moreover, I can't explain how the compiler even knows how to construct std::optional from sol::load_result, since none of the std::optional constructors accepts an lvalue...
Any ideas?
While at it, I also figured out that one of the private members of load_result is not used anywhere and can be probably removed - returncount. I can prepare a PR for that as well?
Context:
Sol Version: 3.2.1
GCC version: 7.5 (standard package on Ubuntu18.04 with latest updates)
Lua 5.1.1 (standard lua)
Debug vs. Release - both trigger the problem