Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix use after free in coro::task_container #240

Merged
merged 2 commits into from
Jan 21, 2024

Conversation

ripose-jp
Copy link
Contributor

In coro::task_container, this is potential for a use after free bug to occur between the final co_return of make_cleanup_task and another thread executing gc_internal. This can happen because there is a delay between the std::scoped_lock being destroyed upon the co_return of make_cleanup_task and its final suspend. This leads to a race condition where another thread can acquire the lock and initiate a garbage collection, where that cleanup task is destroyed before it's final suspend. This leads to a use after free.

This commit fixes the issue by reworking how tasks get garbage collected. Instead of unconditionally destroying tasks scheduled for deletion, the gc_internal calls is_ready on the coro::task first. This function will only be true in the cases where the task has entered final suspend, has been default constructed, or has been destroyed. This means a task can get scheduled for deletion, but not be deleted when gc_internal is called. This allows time for the task to enter final suspend, which avoids the use after free bug.

To make this implementation easier, several aspects of the implementation were changed. m_task_indexes was removed in favor of a queue which keeps track of free indices into m_tasks. This makes things easier since it removes the hard constraint of keeping used indices behind m_free_pos and free indices at or after. m_tasks_to_delete was then changed to a list in order to allow for constant time deletion of elements in place. This is helpful because it is possible for tasks to be queue for deletion, but not yet safe to delete. This means unconditionally clearing the m_tasks_to_delete data structure is no longer viable.


Here's some example code that can cause the use-after-free bug. Since it's a race condition, it's non-deterministic so it may take some time before the bug occurs.

#include <coro/coro.hpp>

#include <cstdint>
#include <iostream>
#include <numeric>
#include <vector>

static constexpr size_t DUMMY_TASK_COUNT{10};

static coro::task<void> do_dummy_work()
{
    constexpr size_t SIZE_DOUBLE_WORDS{100 * 1024 * 1024 / sizeof(uint64_t)};

    [[maybe_unused]] std::vector<uint64_t> buffer;
    buffer.resize(SIZE_DOUBLE_WORDS);
    std::iota(std::begin(buffer), std::end(buffer), 0);
    co_return;
}

static coro::task<void> run_async(
    std::shared_ptr<coro::thread_pool> &tp,
    coro::task_container<coro::thread_pool> &tc)
{
    co_await tp->schedule();

    while (true)
    {
        for (size_t i{0}; i < DUMMY_TASK_COUNT; ++i)
        {
            tc.start(do_dummy_work());
        }
        co_await tc.garbage_collect_and_yield_until_empty();
        std::cout << "Dummy work iteration complete\n";
    }
}

int main(void)
{
    std::shared_ptr<coro::thread_pool> tp = std::make_shared<coro::thread_pool>(
        coro::thread_pool::options{ .thread_count = DUMMY_TASK_COUNT, }
    );
    coro::task_container tc{tp};
    coro::sync_wait(run_async(tp, tc));
    return 0;
}

In coro::task_container, this is potential for a use after free bug to
occur between the final co_return of make_cleanup_task and another
thread executing gc_internal. This can happen because there is a delay
between the std::scoped_lock being destroyed upon the co_return of
make_cleanup_task and its final suspend. This leads to a race condition
where another thread can acquire the lock and initiate a garbage
collection, where that cleanup task is destroyed before it's final
suspend. This leads to a use after free.

This commit fixes the issue by reworking how tasks get garbage
collected. Instead of unconditionally destroying tasks scheduled for
deletion, the gc_internal calls is_ready on the coro::task first. This
function will only be true in the cases where the task has entered
final suspend, has been default constructed, or has been destroyed. This
means a task can get scheduled for deletion, but not be deleted when
gc_internal is called. This allows time for the task to enter final
suspend, which avoids the use after free bug.

To make this implementation easier, several aspects of the
implementation were changed. m_task_indexes was removed in favor of
a queue which keeps track of free indices into m_tasks. This makes
things easier since it removes the hard constraint of keeping used
indices behind m_free_pos and free indices at or after.
m_tasks_to_delete was then changed to a list in order to allow for
constant time deletion of elements in place. This is helpful because it
is possible for tasks to be queue for deletion, but not yet safe to
delete. This means unconditionally clearing the m_tasks_to_delete data
structure is no longer viable.
@ripose-jp ripose-jp force-pushed the fix_task_container_use_after_free branch from e6daff7 to 00efd0f Compare January 18, 2024 03:25
@jbaldwin
Copy link
Owner

I'll take a deep dive into this as soon as I can, the description of the use after free makes sense, and thank you for pushing a detailed PR with a solution to the problem.

@jbaldwin jbaldwin merged commit 0b65aa2 into jbaldwin:main Jan 21, 2024
uilianries added a commit to uilianries/libcoro that referenced this pull request Feb 15, 2024
* CMake: Keep git hooks as optional (jbaldwin#234)

* Keep git hooks as optional since they are not required for using the library, only for development.
* Cmake LIBCORO_RUN_GITCONFIG=ON|OFF option

* Introduce concepts::io_executor (jbaldwin#238)

* Introduce concepts::io_scheduler

The dns resolver hard links against coro::io_scheduler via coro::task_container<coro::io_scheduler>. Since the concepts::executor was introduced a while ago to make it possible to pass in different executors (or even user defined executors) the resolver has apparently been forgotten about. Lets introduce an io_executor (needs poll()) and make the resolver use an agnostic concepts::executor.

The other reason for doing this is that the windows build is failing when LIBCORO_FEATURE_PLATFORM=OFF and building as a SHARED library since the coro::io_scheduler is sneaking in via the resolver class but then ultimately isn't compiled since its excluded from the build.

Closes jbaldwin#237

* Remove LIBCORO_FEATURE_PLAFTORM

It really is identical with LIBCORO_FEATURE_NETWORKING at this point, it
just wasn't so obvious.

* Add shared library option (jbaldwin#236)

* It generates static only or shared libraries according to LIBCORO_BUILD_SHARED_LIBS option via the `cmake` `BUILD_SHARED_LIBS` to propagate to all libraries built by the project.
* When generating shared library in Windows, it will produce bin/libcoro.dll and lib/libcoro.lib (exposed symbols to load in the dll)
* The CMake definition `CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS` does not export static variables
  * To export semaphore's variables a new header file is created, the `coro/export.hpp`. It's generated by `cmake` using the method `generate_export_header`, and should manage `__declspec`.
* The ctest does not include the DLL folder by default, so it is configured by using `set_tests_properties`

* Fix use after free in coro::task_container (jbaldwin#240)

In coro::task_container, this is potential for a use after free bug to
occur between the final co_return of make_cleanup_task and another
thread executing gc_internal. This can happen because there is a delay
between the std::scoped_lock being destroyed upon the co_return of
make_cleanup_task and its final suspend. This leads to a race condition
where another thread can acquire the lock and initiate a garbage
collection, where that cleanup task is destroyed before it's final
suspend. This leads to a use after free.

This commit fixes the issue by reworking how tasks get garbage
collected. Instead of unconditionally destroying tasks scheduled for
deletion, the gc_internal calls is_ready on the coro::task first. This
function will only be true in the cases where the task has entered
final suspend, has been default constructed, or has been destroyed. This
means a task can get scheduled for deletion, but not be deleted when
gc_internal is called. This allows time for the task to enter final
suspend, which avoids the use after free bug.

To make this implementation easier, several aspects of the
implementation were changed. m_task_indexes was removed in favor of
a queue which keeps track of free indices into m_tasks. This makes
things easier since it removes the hard constraint of keeping used
indices behind m_free_pos and free indices at or after.
m_tasks_to_delete was then changed to a list in order to allow for
constant time deletion of elements in place. This is helpful because it
is possible for tasks to be queue for deletion, but not yet safe to
delete. This means unconditionally clearing the m_tasks_to_delete data
structure is no longer viable.

* Upgrade tl::expected to 1.1 (jbaldwin#243)

* Creating a CI workflow for macos, (jbaldwin#249)

- Created ci-macos.yml
  - tweaked CMake to deal with clang17 location

* Fix for higher version of C++ usage, Clang support, and some build policy change. (jbaldwin#246)

* Enable test/example build only if project is top level. LIBCORO_FEATURE_NETWORKING and LIBCORO_FEATURE_TLS macro should set on Linux environment only.

* Make compilable above the C++20 mode.

* Missing include: <exception>

* Update CMakeLists.txt

* Calling coro::sync_wait with coro::ring_buffer::consume returns default constructed objects for complex return values (jbaldwin#244)

Running the test that replicates the bug via valgrind or asan it shows
that the compiler is calling sync_wait()'s promise's destructor before
moving the complex object return value. This causes the object to be
destructed and then the final move out is on deleted (use after free)
memory causing the object to be in a bad empty state.

To resolve this for now a double move has been introduced to move the
complex object off the promise object and onto the sync_wait() function
callstack. This seems to keep the object alive and not be destructed
when sync_wait() finally returns.

Other solutions could be to heap allocate the promise() object or even
the return_type but this is probably more expensive than a double move,
but will vary on use case.

Closes jbaldwin#242

* Release v0.11 (jbaldwin#250)

Closes jbaldwin#241

* Revert cmake PROJECT_IS_TOP_LEVEL (jbaldwin#252)

This only appears to work with cmake >= 3.25 which isn't ready for most
of the CI runs, so it will be reverted for now and re-worked moving
forward.

* Release v0.11.1 hotfix (jbaldwin#253)

* Added missing header '<atomic>' when compiling with clang and c++23 (jbaldwin#254)

* CI explicitly run CMAKE_CXX_STANDARD=20|23 (jbaldwin#257)

CI Added explicit 20/23 standards for:
* ci-fedora
* ci-macos
* ci-ubuntu
* ci-windows

Did not add 23 builds for
* ci-opensuse
* ci_emscripten

Closes jbaldwin#256

* Initial try to include Conan in the CI

Signed-off-by: Uilian Ries <[email protected]>

---------

Signed-off-by: Uilian Ries <[email protected]>
Co-authored-by: Josh Baldwin <[email protected]>
Co-authored-by: ripose-jp <[email protected]>
Co-authored-by: Bruno Nicoletti <[email protected]>
Co-authored-by: LEE KYOUNGHEON <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants