Fix race in MPSC algo by ccotter · Pull Request #1812 · NVIDIA/stdexec

ccotter · 2026-02-04T05:17:07Z

Relacy helped identify a race in the existing MPSC algo. I am having a hard time exactly explaining what's going on, but in the newly added unit test (the non Relacy one), I am able to observe three different odd behaviors

a consumer consuming the same elemment in an finite loop, apparently due to the internal next pointers pointing in some sort of cycle
consumer returning &_nil!
consumer never able to consume a produced value (node is lost)

With the non-relacy unit test, in the existing algo, if I insert a random sleep of 0-10 microseconds in push_back after _back is exchanged, I can observe one of the above behaviors nearly every single time. The most common was the first behavior.

The existing algo claims it came from Dmitry Vyukov's implementation, though one key difference is that the existing one uses an atomic pointer to a Node for the "nil" object, whereas Dmitry's stores an actual Node object embedded in the queue.

I re-implemented the version in stdexec exactly as it appears on Dmitry's website (which I had to dig up on archive.org), and it passes newly added Relacy (exploring many thread interleavings) and non-Relacy unit tests.

I originally tracked down a bug in timed_thread_scheduler.cpp, where sometimes STDEXEC_ASSERT(op->command_ == command_type::command_type::stop); failed.

Relacy helped identify a race in the existing MPSC algo. I am having a hard time exactly explaining what's going on, but in the newly added unit test (the non Relacy one), I am able to observe three different odd behaviors - a consumer consuming the same elemment in an finite loop, apparently due to the internal next pointers pointing in some sort of cycle - consumer returning &__nil_! - consumer never able to consume a produced value (node is lost) With the non-relacy unit test, in the existing algo, if I insert a random sleep of 0-10 microseconds in push_back after __back_ is exchanged, I can observe one of the above behaviors nearly every single time. The most common was the first behavior. The existing algo claims it came from Dmitry Vyukov's implementation, though one key difference is that the existing one uses an atomic pointer to a Node for the "nil" object, whereas Dmitry's stores an actual Node object embedded in the queue. I re-implemented the version in stdexec exactly as it appears on Dmitry's website (which I had to dig up on archive.org), and it passes newly added Relacy (exploring many thread interleavings) and non-Relacy unit tests. I originally tracked down a bug in timed_thread_scheduler.cpp, where sometimes `STDEXEC_ASSERT(op->command_ == command_type::command_type::stop);` failed.

copy-pr-bot · 2026-02-04T05:17:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ccotter · 2026-02-04T05:18:22Z

cc @maikel - can you check this over, since I think you implemented the original version of the MPSC queue?

ccotter · 2026-02-04T05:20:03Z

Dmitry's website appears to have stopped working a few months ago, though the original link is on archive.org: https://web.archive.org/web/20221127081044/https://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue (Dmitry is also the author of Relacy).

ccotter · 2026-02-04T05:21:26Z

Also of note, I originally noticed the race/bug in timed_thread_scheduler while running the exec unit tests through with a version of TSAN I am trying to upstream which fuzzes the scheduler with random sleeps, and this is another success of that approach I think: llvm/llvm-project#178836

ccotter · 2026-02-04T17:05:08Z

I caught up with @maikel. I'm converting this to a draft for now, as @maikel will take a closer look to understand if the existing algo can be fixed. We also suspect my version may be overly specified in the atomics (it appears to be correct, but could be more relaxed).

ccotter · 2026-02-04T17:33:41Z

@maikel as an experiment to help narrow down where the race leading to the queue state becoming corrupt, I added a mutex in two places: https://gist.github.com/ccotter/37b0d3cb3ad88f8c4fea4b2adbb5dce7 ... Relacy says this queue is correct...

maikel · 2026-02-04T17:56:24Z

Yes after revisiting the old code I agree that the management of nil is the culprit here. We should simplify the implementation towards dmitrys original one. The only thing I would check is whether your usage of memory orderings and atomics can be improved

ericniebler · 2026-02-04T20:38:37Z

/ok to test 0a9df03

- Temp workaround for relacy to compile in CI - Do not build Relacy tests with sanitizers by default

ericniebler · 2026-02-08T21:20:09Z

/ok to test 7e4e3c6

ericniebler · 2026-02-09T22:51:39Z

/ok to test b33534f

ccotter · 2026-02-11T05:08:45Z

My changes to build and run the relacy tests are failing because TBB ends up needing std::shared_mutex. My local build does not run with TBB, so I didn't encounter this. I will soon have a PR to upstream relacy to add support for std::shared_mutex, which should address this. If I can't get that in / merge in the next few days, I'll disable the relacy tests in this PR for the time being.

- Use STDEXEC::__std::atomic - Update compiler requirements for Relacy - Compile Relacy with TBB off

ccotter · 2026-02-12T05:02:14Z

Ok, I believe this should be good to go. Do you mind squashing this as the individual commits are more noisy than valuable in the history.

Of note, I found two compiler crashes, one in gcc (not sure if the bugzilla reprot is in yet) and one in clang (llvm/llvm-project#181088) while working on this!

@dvyukov

This commit introduces an "adaptive delay" feature to the ThreadSanitizer runtime to improve race detection by perturbing thread schedules. At various synchronization points (atomic operations, mutexes, and thread lifecycle events), the runtime may inject small delays (spin loops, yields, or sleeps) to explore different thread interleavings and expose data races that would otherwise occur only in rare execution orders. This change is inspired by prior work, which is discussed in more detail on https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969. In short, https://reviews.llvm.org/D65383 was an earlier unmerged attempt at adding a random delays. Feedback on the RFC led to the version in this commit, aiming to limit the amount of delay. The adaptive delay feature uses a configurable time budget and tiered sampling strategy to balance race exposure against performance impact. It prioritizes high-value synchronization points with clear happens-before relationships: relaxed atomics receive lightweight spin delays with low sampling, synchronizing atomics (acquire / release / seq_cst) receive moderate delays with higher sampling, and mutex and thread lifecycle operations receive the longest delays with highest sampling. The feature is disabled by default and incurs minimal overhead when not enabled. Nearly all checks are guarded by an inline check on a global variable that is only set when enable_adaptive_delay=1. Microbenchmarks with tight loops of atomic operations showed no meaningful performance difference between an unmodified TSAN runtime and this version when running with empty TSAN_OPTIONS. An LLM assisted in writing portions of the adaptive delay logic, including the TimeBudget class, tiering concept, address sampler, and per-thread quota system. I reviewed the output and made amendments to reduce duplication and simplify the behavior. I also replaced the LLM's original double-based calculation logic with the integer-based Percent class. The LLM also helped write unit test cases for Percent. cc @dvyukov ## Examples I used the delay scheduler to find novel bugs that rarely or never occurred with the unmodified TSAN runtime. Some of the bugs below were found with earlier versions of the delay scheduler that I iterated on, but with this most recent implementation in this PR, I can still find the bugs far more reliably than with the standard TSAN runtime. - A use-after-free in the [BlazingMQ](https://github.com/bloomberg/blazingmq) broker during ungraceful producer disconnect. - Race in stdexec: NVIDIA/stdexec#1395 - Race in stdexec's MPSC queue: NVIDIA/stdexec#1812 - A few races in [BDE](https://github.com/bloomberg/bde) thread enabled data structures/algorithms. - The "Data race on variable a" test from https://ceur-ws.org/Vol-2344/paper9.pdf is more reliably reproduced with more aggressive adaptive scheduler options # Outstanding work - The [RFC](https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969) suggests moving the scheduler to sanitizer_common, so that ASAN can leverage this. This should be done (should it be done in this PR?). - Missing interceptors for libdispatch

@dvyukov

This commit introduces an "adaptive delay" feature to the ThreadSanitizer runtime to improve race detection by perturbing thread schedules. At various synchronization points (atomic operations, mutexes, and thread lifecycle events), the runtime may inject small delays (spin loops, yields, or sleeps) to explore different thread interleavings and expose data races that would otherwise occur only in rare execution orders. This change is inspired by prior work, which is discussed in more detail on https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969. In short, https://reviews.llvm.org/D65383 was an earlier unmerged attempt at adding a random delays. Feedback on the RFC led to the version in this commit, aiming to limit the amount of delay. The adaptive delay feature uses a configurable time budget and tiered sampling strategy to balance race exposure against performance impact. It prioritizes high-value synchronization points with clear happens-before relationships: relaxed atomics receive lightweight spin delays with low sampling, synchronizing atomics (acquire / release / seq_cst) receive moderate delays with higher sampling, and mutex and thread lifecycle operations receive the longest delays with highest sampling. The feature is disabled by default and incurs minimal overhead when not enabled. Nearly all checks are guarded by an inline check on a global variable that is only set when enable_adaptive_delay=1. Microbenchmarks with tight loops of atomic operations showed no meaningful performance difference between an unmodified TSAN runtime and this version when running with empty TSAN_OPTIONS. An LLM assisted in writing portions of the adaptive delay logic, including the TimeBudget class, tiering concept, address sampler, and per-thread quota system. I reviewed the output and made amendments to reduce duplication and simplify the behavior. I also replaced the LLM's original double-based calculation logic with the integer-based Percent class. The LLM also helped write unit test cases for Percent. cc @dvyukov ## Examples I used the delay scheduler to find novel bugs that rarely or never occurred with the unmodified TSAN runtime. Some of the bugs below were found with earlier versions of the delay scheduler that I iterated on, but with this most recent implementation in this PR, I can still find the bugs far more reliably than with the standard TSAN runtime. - A use-after-free in the [BlazingMQ](https://github.com/bloomberg/blazingmq) broker during ungraceful producer disconnect. - Race in stdexec: NVIDIA/stdexec#1395 - Race in stdexec's MPSC queue: NVIDIA/stdexec#1812 - A few races in [BDE](https://github.com/bloomberg/bde) thread enabled data structures/algorithms. - The "Data race on variable a" test from https://ceur-ws.org/Vol-2344/paper9.pdf is more reliably reproduced with more aggressive adaptive scheduler options # Outstanding work - The [RFC](https://discourse.llvm.org/t/rfc-tsan-implementing-a-fuzz-scheduler-for-tsan/80969) suggests moving the scheduler to sanitizer_common, so that ASAN can leverage this. This should be done (should it be done in this PR?). - Missing interceptors for libdispatch

ccotter · 2026-02-17T03:59:04Z

@ericniebler bump? This should be ready for review/testing.

Also ... the custom changes to TSAN I've been using to try to shake out other bugs has just landed in upstream clang. If anyone has easy access to trunk builds, https://github.com/llvm/llvm-project/blob/main/clang/docs/ThreadSanitizer.rst#adaptive-delay describes the feature and various flags.

ericniebler · 2026-02-17T16:02:12Z

/ok to test 208fde6

ericniebler · 2026-02-17T16:03:03Z

/ok to test 659ad65

ccotter mentioned this pull request Feb 4, 2026

[tsan] Introduce Adaptive Delay Scheduling to TSAN llvm/llvm-project#178836

Merged

ccotter added 4 commits February 4, 2026 14:40

Add Relacy tests to the build

5a875d4

Default relacy only fofr GNU

c0720dd

Remove old Relacy Makefile

0585125

Simplify tests

2d595ad

ccotter marked this pull request as draft February 4, 2026 17:04

Merge branch 'main' of https://github.com/nvidia/stdexec into HEAD

0a9df03

maikel and others added 3 commits February 6, 2026 00:02

Improve implementation to be near to Dmitry's

3b6648b

Remove unused line, fix CI

0e9451c

CI fixes

880029e

- Temp workaround for relacy to compile in CI - Do not build Relacy tests with sanitizers by default

ccotter marked this pull request as ready for review February 8, 2026 04:22

Merge branch 'main' of https://github.com/nvidia/stdexec into mpsc-bug

7e4e3c6

Use std::thread

b33534f

ccotter added 2 commits February 12, 2026 04:57

Final fixups

257ab05

- Use STDEXEC::__std::atomic - Update compiler requirements for Relacy - Compile Relacy with TBB off

Merge branch 'main' of https://github.com/NVIDIA/stdexec into HEAD

659ad65

ccotter force-pushed the mpsc-bug branch from 5e64d5a to 659ad65 Compare February 12, 2026 04:58

NVIDIA deleted a comment from copy-pr-bot bot Feb 17, 2026

Conversation

ccotter commented Feb 4, 2026

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

maikel commented Feb 4, 2026

Uh oh!

ericniebler commented Feb 4, 2026

Uh oh!

ericniebler commented Feb 8, 2026

Uh oh!

ericniebler commented Feb 9, 2026

Uh oh!

ccotter commented Feb 11, 2026

Uh oh!

ccotter commented Feb 12, 2026

Uh oh!

ccotter commented Feb 17, 2026

Uh oh!

ericniebler commented Feb 17, 2026

Uh oh!

ericniebler commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants