Skip to content

Comments

[IR Container] Phase 2.6 Concurrency & Thread Safety#5971

Open
mdavis36 wants to merge 1 commit intomd/phase2-copy-movefrom
md/phase2-thread-safety
Open

[IR Container] Phase 2.6 Concurrency & Thread Safety#5971
mdavis36 wants to merge 1 commit intomd/phase2-copy-movefrom
md/phase2-thread-safety

Conversation

@mdavis36
Copy link
Collaborator

@mdavis36 mdavis36 commented Feb 18, 2026

Summary

Add std::shared_mutex to IrContainer for concurrent read access during parallel compilation, remove the kPhase2DisableParallelCompile serialization guard introduced in PR 1, and validate that the full test suite passes with parallel compilation re-enabled.

This is a future-proofing and defensive correctness change. Phase 2's makeFusion path does NOT share containers (each segment gets its own container via the default constructor), so parallel compilation is technically safe without the mutex. However, Phase 3 will change makeFusion to use the copy constructor (shared container), at which point multiple threads will write to the same IrContainer concurrently. The mutex must be in place before Phase 3 can enable that.

The Nested Call Problem and ContainerMutator

Five Fusion methods directly access IrContainer's internal fields because statement registration was moved from IrContainer to Fusion previously:

registerVal()  → writes vals_up_, vals_, per_fusion_vals_
registerExpr() → writes exprs_up_, exprs_, per_fusion_exprs_   (calls removeExpr for SSA)
removeVal()    → writes vals_up_, vals_, per_fusion_vals_      (calls removeExpr)
removeExpr()   → writes exprs_up_, exprs_, per_fusion_exprs_
removeStatementsCreatedAfter() → calls all of the above

removeVal() calls removeExpr(), and registerExpr() also calls removeExpr(). Since std::shared_mutex is not recursive, acquiring unique_lock in both the outer and inner methods would deadlock.

The solution is a two-layer locking architecture:

Layer 1: IrContainer public methods (self-locking)
  Read methods:  std::shared_lock(mutex_)    — concurrent reads OK
  Write methods: std::unique_lock(mutex_)   — exclusive access

Layer 2: Fusion methods that bypass IrContainer (ContainerMutator)
  Public method: acquires std::unique_lock(ir_container()->mutex_)
  Delegates to:  ContainerMutator static methods (lock-free, direct field access)
  Nested calls:  go through ContainerMutator → safe, already under lock

ContainerMutator is forward-declared in fusion.h (2 lines) and fully defined in fusion.cpp. This keeps the header clean and makes the locking architecture self-documenting: everything inside ContainerMutator assumes the lock is already held.

Thread Safety Analysis

                        Phase 2                          Phase 3
                        ───────                          ───────
makeFusion behavior:    Default ctor + Fusion::copy      Copy ctor (shared container)
Container sharing:      No (each segment gets its own)   Yes (scalars reused)
Thread safety needed:   No (reads only on completeFusion) Yes (concurrent writes)

Dead Code Removal

Investigation revealed that IrContainer::copy() and IrContainer::swap() have zero call sites — all copy/move/swap semantics are handled at the Fusion level after previous work. Removing them eliminates ~45 lines of dead code and avoids complex dual-locking patterns.

Relationship to Phase 2

This PR completes the Phase 2 architectural work. With thread safety in place, the full shared scalar infrastructure is ready for Phase 3:

CI Risk

Low-medium. This is the first CI run with parallel compilation re-enabled since PR #5961 serialized it. Any latent concurrency issues would surface here. The parallel compilation path doesn't share containers in Phase 2, so the mutex is defensive — but re-enabling parallelism exercises the full concurrent codegen pipeline.

@mdavis36
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 18, 2026

Description

  • Add std::shared_mutex to IrContainer for thread-safe concurrent access

  • Implement ContainerMutator PIMPL pattern with lock-free static methods

  • Add shared_lock for reads and unique_lock for writes in all container methods

  • Remove kPhase2DisableParallelCompile guard to re-enable parallel compilation

  • Delete dead IrContainer::copy() and IrContainer::swap() methods

Changes walkthrough

Relevant files
Enhancement
container.h
Add shared_mutex and move method implementations                 

csrc/ir/container.h

  • Add #include and mutable std::shared_mutex mutex_ member
  • Move method implementations from header to cpp file (unordered_exprs,
    vals, numExprs, numVals)
  • Add private helper method declarations (inContainerImpl,
    assertInContainerImpl)
  • Remove static copy() and swap() method declarations
  • +14/-20 
    container.cpp
    Add locking to all container methods                                         

    csrc/ir/container.cpp

  • Add std::shared_lock to all read-only methods (deterministic_vals,
    deterministic_exprs, etc.)
  • Add std::unique_lock to all write methods (addFusion, removeFusion,
    etc.)
  • Add locking to all container state access methods
  • Remove dead IrContainer::swap() and IrContainer::copy()
    implementations
  • Add inContainerImpl and assertInContainerImpl helper methods
  • +49/-48 
    fusion.h
    Add ContainerMutator PIMPL forward declaration                     

    csrc/fusion.h

  • Add forward declaration of ContainerMutator struct
  • Make ContainerMutator a friend struct for access to container
    internals
  • +4/-0     
    fusion.cpp
    Implement ContainerMutator PIMPL with lock-free methods   

    csrc/fusion.cpp

  • Add ContainerMutator struct with static lock-free methods (removeExpr,
    removeVal, registerVal, registerExpr, removeStatementsCreatedAfter)
  • Replace Fusion method implementations with unique_lock + delegation to
    ContainerMutator
  • Add std::unique_lock to Fusion::swap method
  • Move all mutation logic to ContainerMutator static methods
  • +213/-198
    fusion_kernel_runtime.cpp
    Re-enable parallel compilation                                                     

    csrc/runtime/fusion_kernel_runtime.cpp

  • Remove kPhase2DisableParallelCompile guard constant
  • Re-enable parallel compilation by removing the disable condition
  • +2/-5     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review
    Lock ordering consistency

    The PR uses std::unique_lock in Fusion methods and std::shared_lock in IrContainer methods. Need to verify that the locking order is consistent across the codebase to prevent potential deadlocks, especially when Fusion methods call IrContainer methods that might also acquire locks.

    void Fusion::removeExpr(Expr* expr) {
      std::unique_lock lock(ir_container()->mutex_);
      ContainerMutator::removeExpr(this, expr);
    }
    
    void Fusion::removeVal(Val* val) {
      std::unique_lock lock(ir_container()->mutex_);
      ContainerMutator::removeVal(this, val);
    }
    
    void Fusion::removeStatementsCreatedAfter(
        int64_t num_exprs_before,
        int64_t num_vals_before) {
      std::unique_lock lock(ir_container()->mutex_);
      ContainerMutator::removeStatementsCreatedAfter(
          this, num_exprs_before, num_vals_before);
    }
    Thread safety of ContainerMutator delegation

    The ContainerMutator static methods are called while holding unique_lock on ir_container()->mutex_, but they access ir_container() internals directly. Need to ensure this pattern is safe and doesn't introduce race conditions when multiple threads are accessing different Fusion instances that share the same IrContainer.

        const noexcept {
      std::shared_lock lock(mutex_);
      std::unordered_map<Val*, int64_t> vals_map;
      int64_t count = 0;
      std::transform(
          vals_up_.begin(),
          vals_up_.end(),
          std::inserter(vals_map, vals_map.end()),
          [&count](const std::unique_ptr<Val>& val_up) {
            return std::make_pair(val_up.get(), count++);
          });
      return vals_map;
    }
    
    //! Return mapping from expression to integer id
    const std::unordered_map<Expr*, int64_t> IrContainer::deterministic_exprs_map()
        const noexcept {
      std::shared_lock lock(mutex_);
      std::unordered_map<Expr*, int64_t> exprs_map;
      int64_t count = 0;
      std::transform(
          exprs_up_.begin(),
          exprs_up_.end(),
          std::inserter(exprs_map, exprs_map.end()),
          [&count](const std::unique_ptr<Expr>& expr_up) {
            return std::make_pair(expr_up.get(), count++);
          });
      return exprs_map;
    }
    
    IrContainer::IrContainer() = default;
    
    IrContainer::~IrContainer() {
      clear();
    }
    
    void IrContainer::clear() noexcept {
      FUSER_PERF_SCOPE("IrContainer clear");
      vals_.clear();
      vals_up_.clear();
      exprs_.clear();
      exprs_up_.clear();
      val_type_name_map_.clear();
      expr_name_counter_ = 0;
      per_fusion_vals_.clear();
      per_fusion_exprs_.clear();
    }
    
    bool IrContainer::inContainer(const Statement* const_stmt) const {
      std::shared_lock lock(mutex_);
      return inContainerImpl(const_stmt);
    }
    
    bool IrContainer::inContainerImpl(const Statement* const_stmt) const {
      // We don't use dynamic_cast here because `const_stmt` may be an invalid
      // pointer. Specifically a pointer to a Statement owned by another container
      // that has been freed.
    
      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
      void* raw_ptr = const_cast<void*>(reinterpret_cast<const void*>(const_stmt));
      if (exprs_.count(reinterpret_cast<Expr*>(raw_ptr)) == 0 &&
          vals_.count(reinterpret_cast<Val*>(raw_ptr)) == 0) {
        return false;
      }
    
      NVF_ERROR(
          sharing_fusions_.count(const_stmt->container()) > 0,
          "Container claims to own stmt, but stmt disagrees.");
    
      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
      auto* stmt = const_cast<Statement*>(const_stmt);
      if (stmt->isExpr()) {
        NVF_ERROR(
            exprs_.find(stmt->as<Expr>()) != exprs_.end(),
            "Somehow container claims to and not to own an Expr.");
      }
      if (stmt->isVal()) {
        NVF_ERROR(
            vals_.find(stmt->as<Val>()) != vals_.end(),
            "Somehow container claims to and not to own an Val.");
      }
    
      return true;
    }
    
    void IrContainer::assertInContainerImpl(
        const Statement* stmt,
        const std::string& msg) const {
      NVF_CHECK(inContainerImpl(stmt), msg, " it was not found in the active container.");
    }
    
    const std::unordered_set<Expr*>& IrContainer::unordered_exprs() const noexcept {
      std::shared_lock lock(mutex_);
      return exprs_;
    }
    
    const std::unordered_set<Val*>& IrContainer::vals() const noexcept {
      std::shared_lock lock(mutex_);
      return vals_;
    }
    
    int64_t IrContainer::numExprs() const noexcept {
      std::shared_lock lock(mutex_);
      return std::ssize(exprs_);
    }
    
    int64_t IrContainer::numVals() const noexcept {
      std::shared_lock lock(mutex_);
      return std::ssize(vals_up_);
    }
    
    void IrContainer::addFusion(Fusion* fusion) {
      std::unique_lock lock(mutex_);
      sharing_fusions_.insert(fusion);
    }
    
    void IrContainer::removeFusion(Fusion* fusion) {
      std::unique_lock lock(mutex_);
      sharing_fusions_.erase(fusion);
    }
    
    void IrContainer::transferFusion(Fusion* from, Fusion* to) {
      std::unique_lock lock(mutex_);
      sharing_fusions_.erase(from);
      sharing_fusions_.insert(to);
    }
    
    size_t IrContainer::sharingCount() const {
      std::shared_lock lock(mutex_);
      return sharing_fusions_.size();
    }
    
    bool IrContainer::hasMultipleFusions() const {
      std::shared_lock lock(mutex_);
      return sharing_fusions_.size() > 1;
    }
    
    const std::unordered_set<Fusion*>& IrContainer::sharingFusions() const {
      std::shared_lock lock(mutex_);
      return sharing_fusions_;
    }
    
    const std::unordered_set<Val*>& IrContainer::valsOwnedBy(
        const Fusion* fusion) const {
      std::shared_lock lock(mutex_);
      static const std::unordered_set<Val*> empty;
      auto it = per_fusion_vals_.find(fusion);
      return it != per_fusion_vals_.end() ? it->second : empty;
    }
    
    const std::unordered_set<Expr*>& IrContainer::exprsOwnedBy(
        const Fusion* fusion) const {
      std::shared_lock lock(mutex_);
      static const std::unordered_set<Expr*> empty;
      auto it = per_fusion_exprs_.find(fusion);
      return it != per_fusion_exprs_.end() ? it->second : empty;
    }
    
    void IrContainer::transferStatementOwnership(
        const Fusion* from,
        const Fusion* to) {
      std::unique_lock lock(mutex_);
      auto vals_it = per_fusion_vals_.find(from);
      if (vals_it != per_fusion_vals_.end()) {
        auto& to_vals = per_fusion_vals_[to];
        to_vals.insert(vals_it->second.begin(), vals_it->second.end());
        per_fusion_vals_.erase(vals_it);
      }
    
      auto exprs_it = per_fusion_exprs_.find(from);
      if (exprs_it != per_fusion_exprs_.end()) {
        auto& to_exprs = per_fusion_exprs_[to];
        to_exprs.insert(exprs_it->second.begin(), exprs_it->second.end());
        per_fusion_exprs_.erase(exprs_it);
      }
    }
    
    void IrContainer::removeStatementsOwnedBy(const Fusion* fusion) {
      std::unique_lock lock(mutex_);
      auto vals_it = per_fusion_vals_.find(fusion);
      if (vals_it != per_fusion_vals_.end()) {
        for (auto it = vals_up_.begin(); it != vals_up_.end();) {
          if (vals_it->second.count(it->get()) > 0) {
            vals_.erase(it->get());
            it = vals_up_.erase(it);
          } else {
            ++it;
          }
        }
        per_fusion_vals_.erase(vals_it);
      }
    
      auto exprs_it = per_fusion_exprs_.find(fusion);
      if (exprs_it != per_fusion_exprs_.end()) {
        for (auto it = exprs_up_.begin(); it != exprs_up_.end();) {
    Parallel compilation re-enabling

    The kPhase2DisableParallelCompile guard is removed, re-enabling parallel compilation. This is a significant change that should be thoroughly tested with concurrent fusion compilation scenarios to ensure the mutex implementation provides adequate protection under load.

        if (num_groups == 1 ||
            isOptionDisabled(DisableOption::ParallelCompile)) {
          compileKernel(group_runtime_inputs, group_to_run);
        } else {
          // launch compileKernel thread here
          getThreadPool()->run([this,
                                &group_runtime_inputs,
                                group_to_run,
                                &thread_pool_error_message,
                                &thread_pool_error_message_mutex]() {
            FUSER_PERF_SCOPE("FusionKernelRuntime::compileFusionParallel");
            try {
              compileKernel(group_runtime_inputs, group_to_run);
            } catch (const std::exception& e) {
              // Set flag inside lambda so we can throw an exception after thread
              // pool completes its work.
              const std::lock_guard<std::mutex> lock(
                  thread_pool_error_message_mutex);
              std::stringstream ss;
              ss << thread_pool_error_message
                 << "\nError from segmentation group " << group_to_run->groupId()
                 << ": " << e.what() << "\n";
              thread_pool_error_message = ss.str();
            }
          });
        }
      }
    } catch (const std::exception& e) {
      // Before cleaning up unique_ptr-backed resources such as
      // SegmentedGroup, make sure all threads are done as they may
      // be still using the resources.
      getThreadPool()->waitWorkComplete();
      throw;
    }
    
    if (num_groups != 1 &&
        !isOptionDisabled(DisableOption::ParallelCompile)) {
      // Wait until all segments finish compiling
      getThreadPool()->waitWorkComplete();
      NVF_ERROR(

    Test failures

    • (Low, 1) Tensor numerical mismatches in nvFuser HopperMatmulTest suite

      Test Name H100 Source
      HopperMatmulTest.PingPongPersistent Link

    Add std::shared_mutex to IrContainer to protect shared mutable state
    during concurrent access from parallel compilation threads.
    
    - IrContainer public methods self-lock (shared_lock for reads,
      unique_lock for writes)
    - Fusion mutation methods (registerVal/Expr, removeVal/Expr,
      removeStatementsCreatedAfter) acquire unique_lock then delegate to
      lock-free ContainerMutator static methods, avoiding self-deadlock
      on nested calls (e.g., removeVal → removeExpr)
    - ContainerMutator is a PIMPL struct defined only in fusion.cpp,
      keeping lock-free impl methods out of the header
    - Remove kPhase2DisableParallelCompile guard, re-enabling parallel
      compilation now that the mutex is in place
    - Delete dead IrContainer::copy() and IrContainer::swap() methods
    @mdavis36 mdavis36 force-pushed the md/phase2-thread-safety branch from 2cac45c to 8fb976b Compare February 18, 2026 03:13
    @mdavis36 mdavis36 force-pushed the md/phase2-copy-move branch from 192fd55 to 35b7405 Compare February 18, 2026 03:13
    @mdavis36 mdavis36 changed the title [IR Container] Phase2 thread safety [IR Container] Phase 2.6 Concurrency & Thread Safety Feb 18, 2026
    @mdavis36
    Copy link
    Collaborator Author

    !test

    @mdavis36 mdavis36 marked this pull request as ready for review February 18, 2026 06:38
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 18, 2026

    Greptile Summary

    This PR adds std::shared_mutex-based thread safety to IrContainer, introduces a ContainerMutator PIMPL pattern to avoid deadlocks from nested lock acquisitions in Fusion mutation methods, removes ~45 lines of dead IrContainer::copy()/IrContainer::swap() code, and re-enables parallel compilation by removing the kPhase2DisableParallelCompile serialization guard.

    • Two-layer locking architecture: Public IrContainer methods self-lock with shared_lock (reads) or unique_lock (writes). Fusion mutation methods (registerVal, registerExpr, removeVal, removeExpr, removeStatementsCreatedAfter) acquire unique_lock once, then delegate to ContainerMutator static methods that access fields lock-free. This avoids deadlocks from nested calls like removeValremoveExpr.
    • Dead code removal: IrContainer::copy() and IrContainer::swap() had zero call sites and are removed cleanly.
    • Parallel compilation re-enabled: The kPhase2DisableParallelCompile guard is removed — Phase 2 doesn't share containers between threads, so parallel compilation is safe.
    • Phase 3 concern: Several reference-returning accessors (vals(), unordered_exprs(), sharingFusions(), valsOwnedBy(), exprsOwnedBy()) hold the lock only for the function scope, meaning callers hold unguarded references after the lock releases. This is safe in Phase 2 but will need revisiting in Phase 3 when containers are shared across threads.

    Confidence Score: 4/5

    • This PR is safe to merge for Phase 2 — the mutex is defensive since containers aren't shared across threads yet.
    • The two-layer locking architecture is well-designed and correctly avoids deadlocks from nested calls. Dead code removal is verified (zero call sites). The only concern is that reference-returning accessors will need revisiting for Phase 3, but the PR description acknowledges the phased approach. Re-enabling parallel compilation is low-risk since Phase 2 doesn't share containers.
    • csrc/ir/container.cpp — reference-returning accessors with short-lived locks will need Phase 3 attention.

    Important Files Changed

    Filename Overview
    csrc/fusion.cpp Introduces ContainerMutator PIMPL struct for lock-free mutation under caller-held unique_lock, replaces inline Fusion methods with delegating two-line wrappers. Logic is correctly preserved from the original implementations.
    csrc/fusion.h Adds forward declaration and friend for ContainerMutator. Minimal header change, clean PIMPL pattern.
    csrc/ir/container.cpp Adds shared_mutex locking to all public accessors and unique_lock to all mutators. Removes dead IrContainer::copy() and IrContainer::swap(). Introduces inContainerImpl()/assertInContainerImpl() lock-free helpers. Reference-returning accessors have lock-lifetime mismatch that will need Phase 3 attention.
    csrc/ir/container.h Declares mutable std::shared_mutex mutex_, moves accessor implementations out-of-line, removes dead copy()/swap() declarations, adds private lock-free helpers.
    csrc/runtime/fusion_kernel_runtime.cpp Removes kPhase2DisableParallelCompile guard, re-enabling parallel compilation. Clean removal with no functional risk in Phase 2.

    Flowchart

    %%{init: {'theme': 'neutral'}}%%
    flowchart TD
        subgraph "Layer 1: Public API (self-locking)"
            A["Fusion::registerVal()"] -->|"unique_lock(mutex_)"| CM_RV["ContainerMutator::registerVal()"]
            B["Fusion::registerExpr()"] -->|"unique_lock(mutex_)"| CM_RE["ContainerMutator::registerExpr()"]
            C["Fusion::removeVal()"] -->|"unique_lock(mutex_)"| CM_RMV["ContainerMutator::removeVal()"]
            D["Fusion::removeExpr()"] -->|"unique_lock(mutex_)"| CM_RME["ContainerMutator::removeExpr()"]
            E["Fusion::removeStatementsCreatedAfter()"] -->|"unique_lock(mutex_)"| CM_RSA["ContainerMutator::removeStatementsCreatedAfter()"]
        end
    
        subgraph "Layer 2: Lock-free internals (ContainerMutator)"
            CM_RE -->|"nested call (no lock)"| CM_RME
            CM_RMV -->|"nested call (no lock)"| CM_RME
            CM_RME -->|"direct access"| IC["IrContainer fields: vals_up_, vals_, exprs_up_, exprs_, per_fusion_*"]
            CM_RV -->|"direct access"| IC
            CM_RSA -->|"direct access"| IC
        end
    
        subgraph "Read Path (shared_lock)"
            R1["IrContainer::vals()"] -->|"shared_lock(mutex_)"| IC
            R2["IrContainer::unordered_exprs()"] -->|"shared_lock(mutex_)"| IC
            R3["IrContainer::inContainer()"] -->|"shared_lock(mutex_)"| IMPL["inContainerImpl()"]
            IMPL --> IC
        end
    
    Loading

    Last reviewed commit: 8fb976b

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    5 files reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    Comment on lines +133 to +141
    const std::unordered_set<Expr*>& IrContainer::unordered_exprs() const noexcept {
    std::shared_lock lock(mutex_);
    return exprs_;
    }

    const std::unordered_set<Val*>& IrContainer::vals() const noexcept {
    std::shared_lock lock(mutex_);
    return vals_;
    }
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Reference outlives lock scope

    unordered_exprs(), vals(), sharingFusions(), valsOwnedBy(), and exprsOwnedBy() all return const& to internal data, but the shared_lock is released when the function returns. This means callers hold an unprotected reference to the data structure, and any concurrent writer (under unique_lock) could mutate or invalidate it.

    This is safe in Phase 2 (containers are not shared across threads), but in Phase 3 where multiple threads write to the same IrContainer, this pattern will produce data races. Consider documenting this as a known limitation to address in Phase 3, or returning by value for these accessors (at the cost of a copy).

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 18, 2026

    Additional Comments (1)

    csrc/ir/container.cpp
    clear() does not acquire the mutex

    IrContainer::clear() modifies all internal data structures (vals_, exprs_, per_fusion_vals_, etc.) without acquiring mutex_. While clear() is only called from IrContainer::~IrContainer() and Fusion::copy() (which is called in the copy constructor/assignment), there is no lock protection if a concurrent thread holds a shared lock on the same container. Since clear() is protected, this may be intentional (caller guarantees exclusive access), but it's worth a brief comment to document that assumption for Phase 3 safety.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    @mdavis36
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant