Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values #242

Closed
ripose-jp opened this issue Jan 21, 2024 · 18 comments · Fixed by #244
Assignees

Comments

@ripose-jp
Copy link
Contributor

ripose-jp commented Jan 21, 2024

When calling coro::sync_wait() with the result of coro::ring_buffer::consume() it is possible for coro::sync_wait() to return a default constructed object instead of the actual object. This only seems to happen with complex objects as simple structs and basic types don't exhibit this behavior.

From what I can tell, this happens somewhere in tl_expected.hpp after returning from ring_buffer::consume_operation::await_return().

I've only ever been able to reproduce this bug using a protobuf message that contains a oneof field, so if this bug report seems highly specific, I apologize.


Here's a minimal example along with some build scripts to simplify the process of testing:

main.cpp

#include <coro/coro.hpp>

#include <iostream>

#include "example.pb.h"

int main(void)
{
    coro::ring_buffer<Example, 1> buffer;
    const auto foo = [&buffer] () -> coro::task<void>
    {
        Example data;
        Message *msg = data.mutable_message();
        msg->set_id(1);
        msg->set_text("Hello world!");
        std::cout << "Inside the coroutine\n";
        std::cout << "Message Case: " << static_cast<int>(data.Messages_case()) << '\n';
        std::cout << "ID: " << data.message().id() << '\n';
        std::cout << "Message: " << data.message().text() << '\n';
        co_await buffer.produce(std::move(data));
    };
    coro::sync_wait(foo());
    auto result = coro::sync_wait(buffer.consume());
    if (!result)
    {
        std::cout << "Consume failed!\n";
        return 0;
    }
    auto data = std::move(*result);
    std::cout << "Outside the coroutine\n";
    std::cout << "Message Case: " << static_cast<int>(data.Messages_case()) << '\n';
    std::cout << "ID: " << data.message().id() << '\n';
    std::cout << "Message: " << data.message().text() << '\n';
    return 0;
}

example.proto

syntax = "proto3";

message Message
{
    uint32 id = 1;
    string text = 2;
};

message Example
{
    oneof Messages
    {
        Message message = 1;
    }
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.25)

project(
    libcoro-sync-wait-default-construct
    VERSION 1.0.0
    LANGUAGES CXX
)

find_package(Protobuf REQUIRED)

include(FetchContent)
set(LIBCORO_BUILD_EXAMPLES OFF)
set(LIBCORO_BUILD_TESTS OFF)
set(LIBCORO_FEATURE_TLS OFF)
set(LIBCORO_FEATURE_NETWORKING OFF)
FetchContent_Declare(
    libcoro
    GIT_REPOSITORY https://github.com/jbaldwin/libcoro.git
    GIT_TAG        0b65aa2b3a2b1227b6c74b59882f1afc95c91dcf
)
FetchContent_MakeAvailable(libcoro)

protobuf_generate_cpp(PROTO_SRCS PROTO_HDRS example.proto)
add_executable(main main.cpp ${PROTO_SRCS} ${PROTO_HDRS})
target_compile_features(main PRIVATE cxx_std_20)
target_compile_options(main PRIVATE -Wall -Wextra -Werror)
target_include_directories(main PRIVATE ${CMAKE_CURRENT_BINARY_DIR})
target_link_libraries(main PRIVATE libcoro protobuf::libprotobuf)
@jbaldwin
Copy link
Owner

jbaldwin commented Jan 22, 2024

I've got a PR open to upgrade tl::expected to the latest version as a start, it looks like it has some bug fixes, but nothing specifically calls out the behavior you are seeing.

https://github.com/TartanLlama/expected/releases

@jbaldwin
Copy link
Owner

TEST_CASE("ring_buffer issue-242 default constructed complex objects on consume", "[ring_buffer]")
{
    struct message
    {
        uint32_t id;
        std::string text;
    };

    struct example
    {
        std::optional<message> msg;
    };

    coro::ring_buffer<example, 1> buffer;

    const auto foo = [&buffer]() -> coro::task<void>
    {
        example data{};
        data.msg = {message{ .id = 1, .text = "Hello World!" }};
        std::cout << "Inside the coroutine\n";
        std::cout << "ID: " << data.msg.value().id << "\n";
        std::cout << "Text: " << data.msg.value().text << "\n";
        co_await buffer.produce(std::move(data));
        co_return;
    };

    coro::sync_wait(foo());
    auto result = coro::sync_wait(buffer.consume());
    REQUIRE(result);

    auto data = std::move(*result);
    REQUIRE(data.msg.has_value());
    REQUIRE(data.msg.value().id == 1);
    REQUIRE(data.msg.value().text == "Hello World!");
    std::cout << "Outside the coroutine\n";
    std::cout << "ID: " << data.msg.value().id << "\n";
    std::cout << "Text: " << data.msg.value().text << "\n";
}

This also seems to be reproducing the issue without protobuf, it fails on REQUIRE(data.msg.has_value())

@jbaldwin
Copy link
Owner

In sync_wait.hpp the sync_wait_task_promise::return_value function:

    auto return_value(return_type&& value) noexcept
    {
        m_return_value = std::addressof(value);
        return final_suspend();
    }

It takes a pointer to the return object instead of moving/taking ownership. Later when result() is called it then does std::move(*m_return_value). This seems to be the problem and causes the value to be default constructed instead of moving the original item out. I'll push a PR and I'd like you to test that it fixes your use case too if you don't mind.

@ripose-jp
Copy link
Contributor Author

No problem. Just tag me when you push it. Thanks for figuring this out.

@jbaldwin
Copy link
Owner

Its going to take a little more time, the fix isn't as easy as I had anticipated, I know the root cause but fixing it is having some cascading issues that were already fixed on coro::task. I'm hoping to possibly port those fixes as well, or just possibly leverage coro::task for coro::sync_wait moving forward.

@ripose-jp
Copy link
Contributor Author

No problem. There's no pressure to fix this. I just found a bug and though it was worth reporting. It's not currently affecting anything I'm working on.

jbaldwin added a commit that referenced this issue Feb 2, 2024

Verified

This commit was signed with the committer’s verified signature.
jbaldwin Josh Baldwin
…lt constructed objects for complex return values

There appears to be a bug in tl::expected with the usage of
coro::sync_wait when the return_type is not trivially moveable. It
causes the destructor of the users return_type to be called before the
move constructor when calling the coroutine's promise.result() function.

It is worth noting in the coro::task class this exact same method of
returning the promise's result is used and works fine, but it doesn't
use tl::expected, which likely points to a bug in that library.

To work around this bug having a temporary value in the promise.result()
call appears to make the compiler generate two move constructor calls,
which is less than idea, but at least the users data isn't destroyed
before it is returned.

Closes #242
@jbaldwin
Copy link
Owner

jbaldwin commented Feb 2, 2024

@ripose-jp I've added a PR that I think resolves the issue, the bug does appear to be in tl::expected and its usage around coro::sync_wait

https://github.com/jbaldwin/libcoro/pull/244/files#diff-3048842fe7373a54d457291bfc4c40ba9e84492042a6a087b17263d32bf3fce1R191

This is the important part of the PR if you want to take a look.

jbaldwin added a commit that referenced this issue Feb 2, 2024

Verified

This commit was signed with the committer’s verified signature.
jbaldwin Josh Baldwin
…lt constructed objects for complex return values

There appears to be a bug in tl::expected with the usage of
coro::sync_wait when the return_type is not trivially moveable. It
causes the destructor of the users return_type to be called before the
move constructor when calling the coroutine's promise.result() function.

It is worth noting in the coro::task class this exact same method of
returning the promise's result is used and works fine, but it doesn't
use tl::expected, which likely points to a bug in that library.

To work around this bug having a temporary value in the promise.result()
call appears to make the compiler generate two move constructor calls,
which is less than idea, but at least the users data isn't destroyed
before it is returned.

Closes #242
jbaldwin added a commit that referenced this issue Feb 2, 2024

Verified

This commit was signed with the committer’s verified signature.
jbaldwin Josh Baldwin
…lt constructed objects for complex return values

There appears to be a bug in tl::expected with the usage of
coro::sync_wait when the return_type is not trivially moveable. It
causes the destructor of the users return_type to be called before the
move constructor when calling the coroutine's promise.result() function.

It is worth noting in the coro::task class this exact same method of
returning the promise's result is used and works fine, but it doesn't
use tl::expected, which likely points to a bug in that library.

To work around this bug having a temporary value in the promise.result()
call appears to make the compiler generate two move constructor calls,
which is less than idea, but at least the users data isn't destroyed
before it is returned.

Closes #242
jbaldwin added a commit that referenced this issue Feb 2, 2024

Verified

This commit was signed with the committer’s verified signature.
jbaldwin Josh Baldwin
…lt constructed objects for complex return values

There appears to be a bug in tl::expected with the usage of
coro::sync_wait when the return_type is not trivially moveable. It
causes the destructor of the users return_type to be called before the
move constructor when calling the coroutine's promise.result() function.

It is worth noting in the coro::task class this exact same method of
returning the promise's result is used and works fine, but it doesn't
use tl::expected, which likely points to a bug in that library.

To work around this bug having a temporary value in the promise.result()
call appears to make the compiler generate two move constructor calls,
which is less than idea, but at least the users data isn't destroyed
before it is returned.

Closes #242
@jbaldwin jbaldwin self-assigned this Feb 2, 2024
@jbaldwin
Copy link
Owner

jbaldwin commented Feb 3, 2024

After thinking about the current solution over night I'm not sure I want to merge this, but it does seem relatively clear its a bug in tl::expected? There is an identical usage of that line (that doesn't work) in coro::task for how it handles its return value and it doesn't have this problem.

@ripose-jp
Copy link
Contributor Author

I'm personally not worried about your PR since it works with my example. This behavior is only only useful in unit testing for me since any production code dealing with coro::ring_buffer is going to be a coroutine. Having to do an extra move is unfortunate, but it's worth it if it produces correct behavior in my opinion.

If you're worried about it, replacing consume operations with coro::task<std::optional<T>> or something similar would be completely reasonable as well.

Fixing tl::expected works too, but since it seems abandoned by the developer, it would probably fall to you to keep it working. This might be more trouble than just avoiding the edge cases where it breaks.

@jbaldwin
Copy link
Owner

jbaldwin commented Feb 4, 2024

Hmm, maybe its time to just roll a coro::expected and use something like std::variant<T, E> internally with a simple interface to give it the same api as c++23's std::expected? Would drop the dependency and also as you noted tl::expected doesn't look particularly well maintained anymore. I dislike using std::optional<T> directly since it doesn't tell the user why the consume operation failed, even though there is really only 1 reason for why it failed at the moment. But as a user you'd need to read the docs to understand std::nullopt means you shut it down (which hopefully you know you did that...).

There is the option of upgrading to c++23 for std::expected but I'm not sure I want to do that yet, I figured this project should be at most 1 major version behind the latest C++ standard, and probably support two LTS releases on the OS distros. I should probably document that somewhere as well.

@ripose-jp
Copy link
Contributor Author

coro::expected would be fine for my use case since I'm not in the position to upgrade to a C++23 compatible compiler for what I'm working on. Of course, its entirely your choice to move to C++23 when you choose, but if it happened soon, I would probably have to maintain my own C++20 compatible version for what I'm working on.

That said, I don't know what the difficulty of maintaining a std::expected compatible object would be, so maybe just using coro::task more generally around the code base would be easier than replacing tl::expected.

@jbaldwin
Copy link
Owner

jbaldwin commented Feb 4, 2024

Yeah I tried replacing the sync_task with just coro::task and it's more difficult than it seems. coro::task might need a way to support the initial suspend and final suspend awaiters generically to make it work properly, it's the one piece that is different from sync_task. On a side note Tlthis also gave me the thought to do something similar in coro::task_container in the cleanup task by using a final awaiter.

Fwiw I'm not planning on upgrading to c++23 anytime soon, would be 3+ years minimum for the next major standard drop, so it isn't a realistic solution today, just was mentioning it.

I think I'll take a stab at coro::expected and see how difficult it is for the singular ring buffer usage.

I appreciate the feedback and thoughts.

@ripose-jp
Copy link
Contributor Author

I feel silly for not thinking of this earlier, but I just tested my example with C++23's std::expected and it exhibits the same behavior as tl::expected on GCC 13.1.0. It might be safe to rule tl::expected out as the culprit here unless both happen to be broken somehow.

@jbaldwin
Copy link
Owner

jbaldwin commented Feb 4, 2024

Hmm that is interesting. Good idea, I'll play around with it as well.

That does point to some kind of bug in libcoro then even though up until now it seems like it was tl::expected. Also confused why the same code in coro::task seems fine.

@jbaldwin
Copy link
Owner

jbaldwin commented Feb 7, 2024

So I tried it with std::optional<Element> and it fails similarly. It might be an issue with how coro::ring_buffer<e, n> is handling the objects?

@jbaldwin
Copy link
Owner

jbaldwin commented Feb 9, 2024

Ok I finally got smart and ran the test through valgrind, and it is a use after free bug.

  1. Call sync_wait(awaitable)
  2. coroutine executes and finishes
  3. execute the return std::move(task).promise().result() line this is in theory moving the promise's return value but is probably just a temporary r-value, nothing "concrete".
  4. sync_wait() returns and destructs all locals, this includes the ~task and the ~promise
  5. Since the promise is destructed it also deletes the return value, we can see the destructor get called here
  6. We then see a move operator= call on the return value type but it reads from the already free'ed memory location on the promise, this gives you the "empty" object in your example.

This probably works on simple types that fit in a register since its not calling the destuctor on the promise?

The solution with the double move gets the return value off the promise so it isn't destructed on the sync_wait() return (step 4) and allows the compiler to see that its being returned.

@ripose-jp
Copy link
Contributor Author

I ran it with asan and it agrees with Valgrind about the use-after-free. I'd say you should merge your double move solution since correctness is going to be more important than performance. The only other way I can think of to get out of this is allocating return values in the heap since this bug doesn't affect trivially copyable types, but that comes with its own set of drawbacks.

jbaldwin added a commit that referenced this issue Feb 11, 2024
…lt constructed objects for complex return values

There appears to be a bug in tl::expected with the usage of
coro::sync_wait when the return_type is not trivially moveable. It
causes the destructor of the users return_type to be called before the
move constructor when calling the coroutine's promise.result() function.

It is worth noting in the coro::task class this exact same method of
returning the promise's result is used and works fine, but it doesn't
use tl::expected, which likely points to a bug in that library.

To work around this bug having a temporary value in the promise.result()
call appears to make the compiler generate two move constructor calls,
which is less than idea, but at least the users data isn't destroyed
before it is returned.

Closes #242
@jbaldwin
Copy link
Owner

Agreed, I'm going to adjust some of the comments since it isn't a tl:: expected bug and then I'll get it merged.

jbaldwin added a commit that referenced this issue Feb 12, 2024

Verified

This commit was signed with the committer’s verified signature.
jbaldwin Josh Baldwin
…lt constructed objects for complex return values

Running the test that replicates the bug via valgrind or asan it shows
that the compiler is calling sync_wait()'s promise's destructor before
moving the complex object return value. This causes the object to be
destructed and then the final move out is on deleted (use after free)
memory causing the object to be in a bad empty state.

To resolve this for now a double move has been introduced to move the
complex object off the promise object and onto the sync_wait() function
callstack. This seems to keep the object alive and not be destructed
when sync_wait() finally returns.

Other solutions could be to heap allocate the promise() object or even
the return_type but this is probably more expensive than a double move,
but will vary on use case.

Closes #242
jbaldwin added a commit that referenced this issue Feb 12, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…lt constructed objects for complex return values (#244)

Running the test that replicates the bug via valgrind or asan it shows
that the compiler is calling sync_wait()'s promise's destructor before
moving the complex object return value. This causes the object to be
destructed and then the final move out is on deleted (use after free)
memory causing the object to be in a bad empty state.

To resolve this for now a double move has been introduced to move the
complex object off the promise object and onto the sync_wait() function
callstack. This seems to keep the object alive and not be destructed
when sync_wait() finally returns.

Other solutions could be to heap allocate the promise() object or even
the return_type but this is probably more expensive than a double move,
but will vary on use case.

Closes #242
uilianries added a commit to uilianries/libcoro that referenced this issue Feb 15, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
* CMake: Keep git hooks as optional (jbaldwin#234)

* Keep git hooks as optional since they are not required for using the library, only for development.
* Cmake LIBCORO_RUN_GITCONFIG=ON|OFF option

* Introduce concepts::io_executor (jbaldwin#238)

* Introduce concepts::io_scheduler

The dns resolver hard links against coro::io_scheduler via coro::task_container<coro::io_scheduler>. Since the concepts::executor was introduced a while ago to make it possible to pass in different executors (or even user defined executors) the resolver has apparently been forgotten about. Lets introduce an io_executor (needs poll()) and make the resolver use an agnostic concepts::executor.

The other reason for doing this is that the windows build is failing when LIBCORO_FEATURE_PLATFORM=OFF and building as a SHARED library since the coro::io_scheduler is sneaking in via the resolver class but then ultimately isn't compiled since its excluded from the build.

Closes jbaldwin#237

* Remove LIBCORO_FEATURE_PLAFTORM

It really is identical with LIBCORO_FEATURE_NETWORKING at this point, it
just wasn't so obvious.

* Add shared library option (jbaldwin#236)

* It generates static only or shared libraries according to LIBCORO_BUILD_SHARED_LIBS option via the `cmake` `BUILD_SHARED_LIBS` to propagate to all libraries built by the project.
* When generating shared library in Windows, it will produce bin/libcoro.dll and lib/libcoro.lib (exposed symbols to load in the dll)
* The CMake definition `CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS` does not export static variables
  * To export semaphore's variables a new header file is created, the `coro/export.hpp`. It's generated by `cmake` using the method `generate_export_header`, and should manage `__declspec`.
* The ctest does not include the DLL folder by default, so it is configured by using `set_tests_properties`

* Fix use after free in coro::task_container (jbaldwin#240)

In coro::task_container, this is potential for a use after free bug to
occur between the final co_return of make_cleanup_task and another
thread executing gc_internal. This can happen because there is a delay
between the std::scoped_lock being destroyed upon the co_return of
make_cleanup_task and its final suspend. This leads to a race condition
where another thread can acquire the lock and initiate a garbage
collection, where that cleanup task is destroyed before it's final
suspend. This leads to a use after free.

This commit fixes the issue by reworking how tasks get garbage
collected. Instead of unconditionally destroying tasks scheduled for
deletion, the gc_internal calls is_ready on the coro::task first. This
function will only be true in the cases where the task has entered
final suspend, has been default constructed, or has been destroyed. This
means a task can get scheduled for deletion, but not be deleted when
gc_internal is called. This allows time for the task to enter final
suspend, which avoids the use after free bug.

To make this implementation easier, several aspects of the
implementation were changed. m_task_indexes was removed in favor of
a queue which keeps track of free indices into m_tasks. This makes
things easier since it removes the hard constraint of keeping used
indices behind m_free_pos and free indices at or after.
m_tasks_to_delete was then changed to a list in order to allow for
constant time deletion of elements in place. This is helpful because it
is possible for tasks to be queue for deletion, but not yet safe to
delete. This means unconditionally clearing the m_tasks_to_delete data
structure is no longer viable.

* Upgrade tl::expected to 1.1 (jbaldwin#243)

* Creating a CI workflow for macos, (jbaldwin#249)

- Created ci-macos.yml
  - tweaked CMake to deal with clang17 location

* Fix for higher version of C++ usage, Clang support, and some build policy change. (jbaldwin#246)

* Enable test/example build only if project is top level. LIBCORO_FEATURE_NETWORKING and LIBCORO_FEATURE_TLS macro should set on Linux environment only.

* Make compilable above the C++20 mode.

* Missing include: <exception>

* Update CMakeLists.txt

* Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values (jbaldwin#244)

Running the test that replicates the bug via valgrind or asan it shows
that the compiler is calling sync_wait()'s promise's destructor before
moving the complex object return value. This causes the object to be
destructed and then the final move out is on deleted (use after free)
memory causing the object to be in a bad empty state.

To resolve this for now a double move has been introduced to move the
complex object off the promise object and onto the sync_wait() function
callstack. This seems to keep the object alive and not be destructed
when sync_wait() finally returns.

Other solutions could be to heap allocate the promise() object or even
the return_type but this is probably more expensive than a double move,
but will vary on use case.

Closes jbaldwin#242

* Release v0.11 (jbaldwin#250)

Closes jbaldwin#241

* Revert cmake PROJECT_IS_TOP_LEVEL (jbaldwin#252)

This only appears to work with cmake >= 3.25 which isn't ready for most
of the CI runs, so it will be reverted for now and re-worked moving
forward.

* Release v0.11.1 hotfix (jbaldwin#253)

* Added missing header '<atomic>' when compiling with clang and c++23 (jbaldwin#254)

* CI explicitly run CMAKE_CXX_STANDARD=20|23 (jbaldwin#257)

CI Added explicit 20/23 standards for:
* ci-fedora
* ci-macos
* ci-ubuntu
* ci-windows

Did not add 23 builds for
* ci-opensuse
* ci_emscripten

Closes jbaldwin#256

* Initial try to include Conan in the CI

Signed-off-by: Uilian Ries <[email protected]>

---------

Signed-off-by: Uilian Ries <[email protected]>
Co-authored-by: Josh Baldwin <[email protected]>
Co-authored-by: ripose-jp <[email protected]>
Co-authored-by: Bruno Nicoletti <[email protected]>
Co-authored-by: LEE KYOUNGHEON <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants