Fix use after free in coro::task_container #240

ripose-jp · 2024-01-18T03:20:15Z

In coro::task_container, this is potential for a use after free bug to occur between the final co_return of make_cleanup_task and another thread executing gc_internal. This can happen because there is a delay between the std::scoped_lock being destroyed upon the co_return of make_cleanup_task and its final suspend. This leads to a race condition where another thread can acquire the lock and initiate a garbage collection, where that cleanup task is destroyed before it's final suspend. This leads to a use after free.

This commit fixes the issue by reworking how tasks get garbage collected. Instead of unconditionally destroying tasks scheduled for deletion, the gc_internal calls is_ready on the coro::task first. This function will only be true in the cases where the task has entered final suspend, has been default constructed, or has been destroyed. This means a task can get scheduled for deletion, but not be deleted when gc_internal is called. This allows time for the task to enter final suspend, which avoids the use after free bug.

To make this implementation easier, several aspects of the implementation were changed. m_task_indexes was removed in favor of a queue which keeps track of free indices into m_tasks. This makes things easier since it removes the hard constraint of keeping used indices behind m_free_pos and free indices at or after. m_tasks_to_delete was then changed to a list in order to allow for constant time deletion of elements in place. This is helpful because it is possible for tasks to be queue for deletion, but not yet safe to delete. This means unconditionally clearing the m_tasks_to_delete data structure is no longer viable.

Here's some example code that can cause the use-after-free bug. Since it's a race condition, it's non-deterministic so it may take some time before the bug occurs.

#include <coro/coro.hpp>

#include <cstdint>
#include <iostream>
#include <numeric>
#include <vector>

static constexpr size_t DUMMY_TASK_COUNT{10};

static coro::task<void> do_dummy_work()
{
    constexpr size_t SIZE_DOUBLE_WORDS{100 * 1024 * 1024 / sizeof(uint64_t)};

    [[maybe_unused]] std::vector<uint64_t> buffer;
    buffer.resize(SIZE_DOUBLE_WORDS);
    std::iota(std::begin(buffer), std::end(buffer), 0);
    co_return;
}

static coro::task<void> run_async(
    std::shared_ptr<coro::thread_pool> &tp,
    coro::task_container<coro::thread_pool> &tc)
{
    co_await tp->schedule();

    while (true)
    {
        for (size_t i{0}; i < DUMMY_TASK_COUNT; ++i)
        {
            tc.start(do_dummy_work());
        }
        co_await tc.garbage_collect_and_yield_until_empty();
        std::cout << "Dummy work iteration complete\n";
    }
}

int main(void)
{
    std::shared_ptr<coro::thread_pool> tp = std::make_shared<coro::thread_pool>(
        coro::thread_pool::options{ .thread_count = DUMMY_TASK_COUNT, }
    );
    coro::task_container tc{tp};
    coro::sync_wait(run_async(tp, tc));
    return 0;
}

In coro::task_container, this is potential for a use after free bug to occur between the final co_return of make_cleanup_task and another thread executing gc_internal. This can happen because there is a delay between the std::scoped_lock being destroyed upon the co_return of make_cleanup_task and its final suspend. This leads to a race condition where another thread can acquire the lock and initiate a garbage collection, where that cleanup task is destroyed before it's final suspend. This leads to a use after free. This commit fixes the issue by reworking how tasks get garbage collected. Instead of unconditionally destroying tasks scheduled for deletion, the gc_internal calls is_ready on the coro::task first. This function will only be true in the cases where the task has entered final suspend, has been default constructed, or has been destroyed. This means a task can get scheduled for deletion, but not be deleted when gc_internal is called. This allows time for the task to enter final suspend, which avoids the use after free bug. To make this implementation easier, several aspects of the implementation were changed. m_task_indexes was removed in favor of a queue which keeps track of free indices into m_tasks. This makes things easier since it removes the hard constraint of keeping used indices behind m_free_pos and free indices at or after. m_tasks_to_delete was then changed to a list in order to allow for constant time deletion of elements in place. This is helpful because it is possible for tasks to be queue for deletion, but not yet safe to delete. This means unconditionally clearing the m_tasks_to_delete data structure is no longer viable.

jbaldwin · 2024-01-18T17:17:28Z

I'll take a deep dive into this as soon as I can, the description of the use after free makes sense, and thank you for pushing a detailed PR with a solution to the problem.

include/coro/task_container.hpp

* CMake: Keep git hooks as optional (jbaldwin#234) * Keep git hooks as optional since they are not required for using the library, only for development. * Cmake LIBCORO_RUN_GITCONFIG=ON|OFF option * Introduce concepts::io_executor (jbaldwin#238) * Introduce concepts::io_scheduler The dns resolver hard links against coro::io_scheduler via coro::task_container<coro::io_scheduler>. Since the concepts::executor was introduced a while ago to make it possible to pass in different executors (or even user defined executors) the resolver has apparently been forgotten about. Lets introduce an io_executor (needs poll()) and make the resolver use an agnostic concepts::executor. The other reason for doing this is that the windows build is failing when LIBCORO_FEATURE_PLATFORM=OFF and building as a SHARED library since the coro::io_scheduler is sneaking in via the resolver class but then ultimately isn't compiled since its excluded from the build. Closes jbaldwin#237 * Remove LIBCORO_FEATURE_PLAFTORM It really is identical with LIBCORO_FEATURE_NETWORKING at this point, it just wasn't so obvious. * Add shared library option (jbaldwin#236) * It generates static only or shared libraries according to LIBCORO_BUILD_SHARED_LIBS option via the `cmake` `BUILD_SHARED_LIBS` to propagate to all libraries built by the project. * When generating shared library in Windows, it will produce bin/libcoro.dll and lib/libcoro.lib (exposed symbols to load in the dll) * The CMake definition `CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS` does not export static variables * To export semaphore's variables a new header file is created, the `coro/export.hpp`. It's generated by `cmake` using the method `generate_export_header`, and should manage `__declspec`. * The ctest does not include the DLL folder by default, so it is configured by using `set_tests_properties` * Fix use after free in coro::task_container (jbaldwin#240) In coro::task_container, this is potential for a use after free bug to occur between the final co_return of make_cleanup_task and another thread executing gc_internal. This can happen because there is a delay between the std::scoped_lock being destroyed upon the co_return of make_cleanup_task and its final suspend. This leads to a race condition where another thread can acquire the lock and initiate a garbage collection, where that cleanup task is destroyed before it's final suspend. This leads to a use after free. This commit fixes the issue by reworking how tasks get garbage collected. Instead of unconditionally destroying tasks scheduled for deletion, the gc_internal calls is_ready on the coro::task first. This function will only be true in the cases where the task has entered final suspend, has been default constructed, or has been destroyed. This means a task can get scheduled for deletion, but not be deleted when gc_internal is called. This allows time for the task to enter final suspend, which avoids the use after free bug. To make this implementation easier, several aspects of the implementation were changed. m_task_indexes was removed in favor of a queue which keeps track of free indices into m_tasks. This makes things easier since it removes the hard constraint of keeping used indices behind m_free_pos and free indices at or after. m_tasks_to_delete was then changed to a list in order to allow for constant time deletion of elements in place. This is helpful because it is possible for tasks to be queue for deletion, but not yet safe to delete. This means unconditionally clearing the m_tasks_to_delete data structure is no longer viable. * Upgrade tl::expected to 1.1 (jbaldwin#243) * Creating a CI workflow for macos, (jbaldwin#249) - Created ci-macos.yml - tweaked CMake to deal with clang17 location * Fix for higher version of C++ usage, Clang support, and some build policy change. (jbaldwin#246) * Enable test/example build only if project is top level. LIBCORO_FEATURE_NETWORKING and LIBCORO_FEATURE_TLS macro should set on Linux environment only. * Make compilable above the C++20 mode. * Missing include: <exception> * Update CMakeLists.txt * Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values (jbaldwin#244) Running the test that replicates the bug via valgrind or asan it shows that the compiler is calling sync_wait()'s promise's destructor before moving the complex object return value. This causes the object to be destructed and then the final move out is on deleted (use after free) memory causing the object to be in a bad empty state. To resolve this for now a double move has been introduced to move the complex object off the promise object and onto the sync_wait() function callstack. This seems to keep the object alive and not be destructed when sync_wait() finally returns. Other solutions could be to heap allocate the promise() object or even the return_type but this is probably more expensive than a double move, but will vary on use case. Closes jbaldwin#242 * Release v0.11 (jbaldwin#250) Closes jbaldwin#241 * Revert cmake PROJECT_IS_TOP_LEVEL (jbaldwin#252) This only appears to work with cmake >= 3.25 which isn't ready for most of the CI runs, so it will be reverted for now and re-worked moving forward. * Release v0.11.1 hotfix (jbaldwin#253) * Added missing header '<atomic>' when compiling with clang and c++23 (jbaldwin#254) * CI explicitly run CMAKE_CXX_STANDARD=20|23 (jbaldwin#257) CI Added explicit 20/23 standards for: * ci-fedora * ci-macos * ci-ubuntu * ci-windows Did not add 23 builds for * ci-opensuse * ci_emscripten Closes jbaldwin#256 * Initial try to include Conan in the CI Signed-off-by: Uilian Ries <[email protected]> --------- Signed-off-by: Uilian Ries <[email protected]> Co-authored-by: Josh Baldwin <[email protected]> Co-authored-by: ripose-jp <[email protected]> Co-authored-by: Bruno Nicoletti <[email protected]> Co-authored-by: LEE KYOUNGHEON <[email protected]>

ripose-jp force-pushed the fix_task_container_use_after_free branch from e6daff7 to 00efd0f Compare January 18, 2024 03:25

jbaldwin requested changes Jan 18, 2024

View reviewed changes

include/coro/task_container.hpp Show resolved Hide resolved

jbaldwin requested changes Jan 19, 2024

View reviewed changes

include/coro/task_container.hpp Show resolved Hide resolved

include/coro/task_container.hpp Show resolved Hide resolved

Merge branch 'main' into fix_task_container_use_after_free

edf38d8

jbaldwin approved these changes Jan 21, 2024

View reviewed changes

jbaldwin merged commit 0b65aa2 into jbaldwin:main Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix use after free in coro::task_container #240

Fix use after free in coro::task_container #240

ripose-jp commented Jan 18, 2024

jbaldwin commented Jan 18, 2024

Fix use after free in coro::task_container #240

Fix use after free in coro::task_container #240

Conversation

ripose-jp commented Jan 18, 2024

jbaldwin commented Jan 18, 2024