Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values #242

ripose-jp · 2024-01-21T22:21:51Z

When calling coro::sync_wait() with the result of coro::ring_buffer::consume() it is possible for coro::sync_wait() to return a default constructed object instead of the actual object. This only seems to happen with complex objects as simple structs and basic types don't exhibit this behavior.

From what I can tell, this happens somewhere in tl_expected.hpp after returning from ring_buffer::consume_operation::await_return().

I've only ever been able to reproduce this bug using a protobuf message that contains a oneof field, so if this bug report seems highly specific, I apologize.

Here's a minimal example along with some build scripts to simplify the process of testing:

main.cpp

#include <coro/coro.hpp>

#include <iostream>

#include "example.pb.h"

int main(void)
{
    coro::ring_buffer<Example, 1> buffer;
    const auto foo = [&buffer] () -> coro::task<void>
    {
        Example data;
        Message *msg = data.mutable_message();
        msg->set_id(1);
        msg->set_text("Hello world!");
        std::cout << "Inside the coroutine\n";
        std::cout << "Message Case: " << static_cast<int>(data.Messages_case()) << '\n';
        std::cout << "ID: " << data.message().id() << '\n';
        std::cout << "Message: " << data.message().text() << '\n';
        co_await buffer.produce(std::move(data));
    };
    coro::sync_wait(foo());
    auto result = coro::sync_wait(buffer.consume());
    if (!result)
    {
        std::cout << "Consume failed!\n";
        return 0;
    }
    auto data = std::move(*result);
    std::cout << "Outside the coroutine\n";
    std::cout << "Message Case: " << static_cast<int>(data.Messages_case()) << '\n';
    std::cout << "ID: " << data.message().id() << '\n';
    std::cout << "Message: " << data.message().text() << '\n';
    return 0;
}

example.proto

syntax = "proto3";

message Message
{
    uint32 id = 1;
    string text = 2;
};

message Example
{
    oneof Messages
    {
        Message message = 1;
    }
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.25)

project(
    libcoro-sync-wait-default-construct
    VERSION 1.0.0
    LANGUAGES CXX
)

find_package(Protobuf REQUIRED)

include(FetchContent)
set(LIBCORO_BUILD_EXAMPLES OFF)
set(LIBCORO_BUILD_TESTS OFF)
set(LIBCORO_FEATURE_TLS OFF)
set(LIBCORO_FEATURE_NETWORKING OFF)
FetchContent_Declare(
    libcoro
    GIT_REPOSITORY https://github.com/jbaldwin/libcoro.git
    GIT_TAG        0b65aa2b3a2b1227b6c74b59882f1afc95c91dcf
)
FetchContent_MakeAvailable(libcoro)

protobuf_generate_cpp(PROTO_SRCS PROTO_HDRS example.proto)
add_executable(main main.cpp ${PROTO_SRCS} ${PROTO_HDRS})
target_compile_features(main PRIVATE cxx_std_20)
target_compile_options(main PRIVATE -Wall -Wextra -Werror)
target_include_directories(main PRIVATE ${CMAKE_CURRENT_BINARY_DIR})
target_link_libraries(main PRIVATE libcoro protobuf::libprotobuf)

The text was updated successfully, but these errors were encountered:

jbaldwin · 2024-01-22T18:45:03Z

I've got a PR open to upgrade tl::expected to the latest version as a start, it looks like it has some bug fixes, but nothing specifically calls out the behavior you are seeing.

https://github.com/TartanLlama/expected/releases

jbaldwin · 2024-01-22T19:28:56Z

TEST_CASE("ring_buffer issue-242 default constructed complex objects on consume", "[ring_buffer]")
{
    struct message
    {
        uint32_t id;
        std::string text;
    };

    struct example
    {
        std::optional<message> msg;
    };

    coro::ring_buffer<example, 1> buffer;

    const auto foo = [&buffer]() -> coro::task<void>
    {
        example data{};
        data.msg = {message{ .id = 1, .text = "Hello World!" }};
        std::cout << "Inside the coroutine\n";
        std::cout << "ID: " << data.msg.value().id << "\n";
        std::cout << "Text: " << data.msg.value().text << "\n";
        co_await buffer.produce(std::move(data));
        co_return;
    };

    coro::sync_wait(foo());
    auto result = coro::sync_wait(buffer.consume());
    REQUIRE(result);

    auto data = std::move(*result);
    REQUIRE(data.msg.has_value());
    REQUIRE(data.msg.value().id == 1);
    REQUIRE(data.msg.value().text == "Hello World!");
    std::cout << "Outside the coroutine\n";
    std::cout << "ID: " << data.msg.value().id << "\n";
    std::cout << "Text: " << data.msg.value().text << "\n";
}

This also seems to be reproducing the issue without protobuf, it fails on REQUIRE(data.msg.has_value())

jbaldwin · 2024-01-22T22:42:44Z

In sync_wait.hpp the sync_wait_task_promise::return_value function:

    auto return_value(return_type&& value) noexcept
    {
        m_return_value = std::addressof(value);
        return final_suspend();
    }

It takes a pointer to the return object instead of moving/taking ownership. Later when result() is called it then does std::move(*m_return_value). This seems to be the problem and causes the value to be default constructed instead of moving the original item out. I'll push a PR and I'd like you to test that it fixes your use case too if you don't mind.

ripose-jp · 2024-01-23T04:27:53Z

No problem. Just tag me when you push it. Thanks for figuring this out.

jbaldwin · 2024-01-23T22:36:58Z

Its going to take a little more time, the fix isn't as easy as I had anticipated, I know the root cause but fixing it is having some cascading issues that were already fixed on coro::task. I'm hoping to possibly port those fixes as well, or just possibly leverage coro::task for coro::sync_wait moving forward.

ripose-jp · 2024-01-24T03:11:22Z

No problem. There's no pressure to fix this. I just found a bug and though it was worth reporting. It's not currently affecting anything I'm working on.

…lt constructed objects for complex return values There appears to be a bug in tl::expected with the usage of coro::sync_wait when the return_type is not trivially moveable. It causes the destructor of the users return_type to be called before the move constructor when calling the coroutine's promise.result() function. It is worth noting in the coro::task class this exact same method of returning the promise's result is used and works fine, but it doesn't use tl::expected, which likely points to a bug in that library. To work around this bug having a temporary value in the promise.result() call appears to make the compiler generate two move constructor calls, which is less than idea, but at least the users data isn't destroyed before it is returned. Closes #242

jbaldwin · 2024-02-02T19:59:09Z

@ripose-jp I've added a PR that I think resolves the issue, the bug does appear to be in tl::expected and its usage around coro::sync_wait

https://github.com/jbaldwin/libcoro/pull/244/files#diff-3048842fe7373a54d457291bfc4c40ba9e84492042a6a087b17263d32bf3fce1R191

This is the important part of the PR if you want to take a look.

…lt constructed objects for complex return values There appears to be a bug in tl::expected with the usage of coro::sync_wait when the return_type is not trivially moveable. It causes the destructor of the users return_type to be called before the move constructor when calling the coroutine's promise.result() function. It is worth noting in the coro::task class this exact same method of returning the promise's result is used and works fine, but it doesn't use tl::expected, which likely points to a bug in that library. To work around this bug having a temporary value in the promise.result() call appears to make the compiler generate two move constructor calls, which is less than idea, but at least the users data isn't destroyed before it is returned. Closes #242

jbaldwin · 2024-02-03T16:56:01Z

After thinking about the current solution over night I'm not sure I want to merge this, but it does seem relatively clear its a bug in tl::expected? There is an identical usage of that line (that doesn't work) in coro::task for how it handles its return value and it doesn't have this problem.

ripose-jp · 2024-02-03T20:44:03Z

I'm personally not worried about your PR since it works with my example. This behavior is only only useful in unit testing for me since any production code dealing with coro::ring_buffer is going to be a coroutine. Having to do an extra move is unfortunate, but it's worth it if it produces correct behavior in my opinion.

If you're worried about it, replacing consume operations with coro::task<std::optional<T>> or something similar would be completely reasonable as well.

Fixing tl::expected works too, but since it seems abandoned by the developer, it would probably fall to you to keep it working. This might be more trouble than just avoiding the edge cases where it breaks.

jbaldwin · 2024-02-04T16:44:47Z

Hmm, maybe its time to just roll a coro::expected and use something like std::variant<T, E> internally with a simple interface to give it the same api as c++23's std::expected? Would drop the dependency and also as you noted tl::expected doesn't look particularly well maintained anymore. I dislike using std::optional<T> directly since it doesn't tell the user why the consume operation failed, even though there is really only 1 reason for why it failed at the moment. But as a user you'd need to read the docs to understand std::nullopt means you shut it down (which hopefully you know you did that...).

There is the option of upgrading to c++23 for std::expected but I'm not sure I want to do that yet, I figured this project should be at most 1 major version behind the latest C++ standard, and probably support two LTS releases on the OS distros. I should probably document that somewhere as well.

ripose-jp · 2024-02-04T22:10:57Z

coro::expected would be fine for my use case since I'm not in the position to upgrade to a C++23 compatible compiler for what I'm working on. Of course, its entirely your choice to move to C++23 when you choose, but if it happened soon, I would probably have to maintain my own C++20 compatible version for what I'm working on.

That said, I don't know what the difficulty of maintaining a std::expected compatible object would be, so maybe just using coro::task more generally around the code base would be easier than replacing tl::expected.

jbaldwin · 2024-02-04T22:35:43Z

Yeah I tried replacing the sync_task with just coro::task and it's more difficult than it seems. coro::task might need a way to support the initial suspend and final suspend awaiters generically to make it work properly, it's the one piece that is different from sync_task. On a side note Tlthis also gave me the thought to do something similar in coro::task_container in the cleanup task by using a final awaiter.

Fwiw I'm not planning on upgrading to c++23 anytime soon, would be 3+ years minimum for the next major standard drop, so it isn't a realistic solution today, just was mentioning it.

I think I'll take a stab at coro::expected and see how difficult it is for the singular ring buffer usage.

I appreciate the feedback and thoughts.

ripose-jp · 2024-02-04T22:44:17Z

I feel silly for not thinking of this earlier, but I just tested my example with C++23's std::expected and it exhibits the same behavior as tl::expected on GCC 13.1.0. It might be safe to rule tl::expected out as the culprit here unless both happen to be broken somehow.

jbaldwin · 2024-02-04T22:52:30Z

Hmm that is interesting. Good idea, I'll play around with it as well.

That does point to some kind of bug in libcoro then even though up until now it seems like it was tl::expected. Also confused why the same code in coro::task seems fine.

jbaldwin · 2024-02-07T19:33:07Z

So I tried it with std::optional<Element> and it fails similarly. It might be an issue with how coro::ring_buffer<e, n> is handling the objects?

jbaldwin · 2024-02-09T17:46:18Z

Ok I finally got smart and ran the test through valgrind, and it is a use after free bug.

Call sync_wait(awaitable)
coroutine executes and finishes
execute the return std::move(task).promise().result() line this is in theory moving the promise's return value but is probably just a temporary r-value, nothing "concrete".
sync_wait() returns and destructs all locals, this includes the ~task and the ~promise
Since the promise is destructed it also deletes the return value, we can see the destructor get called here
We then see a move operator= call on the return value type but it reads from the already free'ed memory location on the promise, this gives you the "empty" object in your example.

This probably works on simple types that fit in a register since its not calling the destuctor on the promise?

The solution with the double move gets the return value off the promise so it isn't destructed on the sync_wait() return (step 4) and allows the compiler to see that its being returned.

ripose-jp · 2024-02-11T22:38:57Z

I ran it with asan and it agrees with Valgrind about the use-after-free. I'd say you should merge your double move solution since correctness is going to be more important than performance. The only other way I can think of to get out of this is allocating return values in the heap since this bug doesn't affect trivially copyable types, but that comes with its own set of drawbacks.

…lt constructed objects for complex return values There appears to be a bug in tl::expected with the usage of coro::sync_wait when the return_type is not trivially moveable. It causes the destructor of the users return_type to be called before the move constructor when calling the coroutine's promise.result() function. It is worth noting in the coro::task class this exact same method of returning the promise's result is used and works fine, but it doesn't use tl::expected, which likely points to a bug in that library. To work around this bug having a temporary value in the promise.result() call appears to make the compiler generate two move constructor calls, which is less than idea, but at least the users data isn't destroyed before it is returned. Closes #242

jbaldwin · 2024-02-12T00:23:26Z

Agreed, I'm going to adjust some of the comments since it isn't a tl:: expected bug and then I'll get it merged.

…lt constructed objects for complex return values Running the test that replicates the bug via valgrind or asan it shows that the compiler is calling sync_wait()'s promise's destructor before moving the complex object return value. This causes the object to be destructed and then the final move out is on deleted (use after free) memory causing the object to be in a bad empty state. To resolve this for now a double move has been introduced to move the complex object off the promise object and onto the sync_wait() function callstack. This seems to keep the object alive and not be destructed when sync_wait() finally returns. Other solutions could be to heap allocate the promise() object or even the return_type but this is probably more expensive than a double move, but will vary on use case. Closes #242

…lt constructed objects for complex return values (#244) Running the test that replicates the bug via valgrind or asan it shows that the compiler is calling sync_wait()'s promise's destructor before moving the complex object return value. This causes the object to be destructed and then the final move out is on deleted (use after free) memory causing the object to be in a bad empty state. To resolve this for now a double move has been introduced to move the complex object off the promise object and onto the sync_wait() function callstack. This seems to keep the object alive and not be destructed when sync_wait() finally returns. Other solutions could be to heap allocate the promise() object or even the return_type but this is probably more expensive than a double move, but will vary on use case. Closes #242

* CMake: Keep git hooks as optional (jbaldwin#234) * Keep git hooks as optional since they are not required for using the library, only for development. * Cmake LIBCORO_RUN_GITCONFIG=ON|OFF option * Introduce concepts::io_executor (jbaldwin#238) * Introduce concepts::io_scheduler The dns resolver hard links against coro::io_scheduler via coro::task_container<coro::io_scheduler>. Since the concepts::executor was introduced a while ago to make it possible to pass in different executors (or even user defined executors) the resolver has apparently been forgotten about. Lets introduce an io_executor (needs poll()) and make the resolver use an agnostic concepts::executor. The other reason for doing this is that the windows build is failing when LIBCORO_FEATURE_PLATFORM=OFF and building as a SHARED library since the coro::io_scheduler is sneaking in via the resolver class but then ultimately isn't compiled since its excluded from the build. Closes jbaldwin#237 * Remove LIBCORO_FEATURE_PLAFTORM It really is identical with LIBCORO_FEATURE_NETWORKING at this point, it just wasn't so obvious. * Add shared library option (jbaldwin#236) * It generates static only or shared libraries according to LIBCORO_BUILD_SHARED_LIBS option via the `cmake` `BUILD_SHARED_LIBS` to propagate to all libraries built by the project. * When generating shared library in Windows, it will produce bin/libcoro.dll and lib/libcoro.lib (exposed symbols to load in the dll) * The CMake definition `CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS` does not export static variables * To export semaphore's variables a new header file is created, the `coro/export.hpp`. It's generated by `cmake` using the method `generate_export_header`, and should manage `__declspec`. * The ctest does not include the DLL folder by default, so it is configured by using `set_tests_properties` * Fix use after free in coro::task_container (jbaldwin#240) In coro::task_container, this is potential for a use after free bug to occur between the final co_return of make_cleanup_task and another thread executing gc_internal. This can happen because there is a delay between the std::scoped_lock being destroyed upon the co_return of make_cleanup_task and its final suspend. This leads to a race condition where another thread can acquire the lock and initiate a garbage collection, where that cleanup task is destroyed before it's final suspend. This leads to a use after free. This commit fixes the issue by reworking how tasks get garbage collected. Instead of unconditionally destroying tasks scheduled for deletion, the gc_internal calls is_ready on the coro::task first. This function will only be true in the cases where the task has entered final suspend, has been default constructed, or has been destroyed. This means a task can get scheduled for deletion, but not be deleted when gc_internal is called. This allows time for the task to enter final suspend, which avoids the use after free bug. To make this implementation easier, several aspects of the implementation were changed. m_task_indexes was removed in favor of a queue which keeps track of free indices into m_tasks. This makes things easier since it removes the hard constraint of keeping used indices behind m_free_pos and free indices at or after. m_tasks_to_delete was then changed to a list in order to allow for constant time deletion of elements in place. This is helpful because it is possible for tasks to be queue for deletion, but not yet safe to delete. This means unconditionally clearing the m_tasks_to_delete data structure is no longer viable. * Upgrade tl::expected to 1.1 (jbaldwin#243) * Creating a CI workflow for macos, (jbaldwin#249) - Created ci-macos.yml - tweaked CMake to deal with clang17 location * Fix for higher version of C++ usage, Clang support, and some build policy change. (jbaldwin#246) * Enable test/example build only if project is top level. LIBCORO_FEATURE_NETWORKING and LIBCORO_FEATURE_TLS macro should set on Linux environment only. * Make compilable above the C++20 mode. * Missing include: <exception> * Update CMakeLists.txt * Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values (jbaldwin#244) Running the test that replicates the bug via valgrind or asan it shows that the compiler is calling sync_wait()'s promise's destructor before moving the complex object return value. This causes the object to be destructed and then the final move out is on deleted (use after free) memory causing the object to be in a bad empty state. To resolve this for now a double move has been introduced to move the complex object off the promise object and onto the sync_wait() function callstack. This seems to keep the object alive and not be destructed when sync_wait() finally returns. Other solutions could be to heap allocate the promise() object or even the return_type but this is probably more expensive than a double move, but will vary on use case. Closes jbaldwin#242 * Release v0.11 (jbaldwin#250) Closes jbaldwin#241 * Revert cmake PROJECT_IS_TOP_LEVEL (jbaldwin#252) This only appears to work with cmake >= 3.25 which isn't ready for most of the CI runs, so it will be reverted for now and re-worked moving forward. * Release v0.11.1 hotfix (jbaldwin#253) * Added missing header '<atomic>' when compiling with clang and c++23 (jbaldwin#254) * CI explicitly run CMAKE_CXX_STANDARD=20|23 (jbaldwin#257) CI Added explicit 20/23 standards for: * ci-fedora * ci-macos * ci-ubuntu * ci-windows Did not add 23 builds for * ci-opensuse * ci_emscripten Closes jbaldwin#256 * Initial try to include Conan in the CI Signed-off-by: Uilian Ries <[email protected]> --------- Signed-off-by: Uilian Ries <[email protected]> Co-authored-by: Josh Baldwin <[email protected]> Co-authored-by: ripose-jp <[email protected]> Co-authored-by: Bruno Nicoletti <[email protected]> Co-authored-by: LEE KYOUNGHEON <[email protected]>

jbaldwin mentioned this issue Jan 22, 2024

Upgrade tl::expected to 1.1 #243

Merged

jbaldwin mentioned this issue Feb 2, 2024

Calling coro::sync_wait with coro::ring_buffer::consume returns defau… #244

Merged

jbaldwin self-assigned this Feb 2, 2024

jbaldwin closed this as completed in #244 Feb 12, 2024

jbaldwin mentioned this issue Jan 24, 2025

Destructor called more than once #286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values #242

Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values #242

ripose-jp commented Jan 21, 2024 •

edited

Loading

jbaldwin commented Jan 22, 2024 •

edited

Loading

jbaldwin commented Jan 22, 2024

jbaldwin commented Jan 22, 2024

ripose-jp commented Jan 23, 2024

jbaldwin commented Jan 23, 2024

ripose-jp commented Jan 24, 2024

jbaldwin commented Feb 2, 2024 •

edited

Loading

jbaldwin commented Feb 3, 2024

ripose-jp commented Feb 3, 2024

jbaldwin commented Feb 4, 2024

ripose-jp commented Feb 4, 2024

jbaldwin commented Feb 4, 2024

ripose-jp commented Feb 4, 2024

jbaldwin commented Feb 4, 2024

jbaldwin commented Feb 7, 2024

jbaldwin commented Feb 9, 2024

ripose-jp commented Feb 11, 2024

jbaldwin commented Feb 12, 2024

Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values #242

Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values #242

Comments

ripose-jp commented Jan 21, 2024 • edited Loading

jbaldwin commented Jan 22, 2024 • edited Loading

jbaldwin commented Jan 22, 2024

jbaldwin commented Jan 22, 2024

ripose-jp commented Jan 23, 2024

jbaldwin commented Jan 23, 2024

ripose-jp commented Jan 24, 2024

jbaldwin commented Feb 2, 2024 • edited Loading

jbaldwin commented Feb 3, 2024

ripose-jp commented Feb 3, 2024

jbaldwin commented Feb 4, 2024

ripose-jp commented Feb 4, 2024

jbaldwin commented Feb 4, 2024

ripose-jp commented Feb 4, 2024

jbaldwin commented Feb 4, 2024

jbaldwin commented Feb 7, 2024

jbaldwin commented Feb 9, 2024

ripose-jp commented Feb 11, 2024

jbaldwin commented Feb 12, 2024

ripose-jp commented Jan 21, 2024 •

edited

Loading

jbaldwin commented Jan 22, 2024 •

edited

Loading

jbaldwin commented Feb 2, 2024 •

edited

Loading