Skip to content

Refactor aicpu_build_graph: ring buffers, explicit deps, scope-end publish#333

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:refactor-aicpu-build-graph
Mar 20, 2026
Merged

Refactor aicpu_build_graph: ring buffers, explicit deps, scope-end publish#333
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:refactor-aicpu-build-graph

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented Mar 20, 2026

Summary

Replace the old aicpu_build_graph runtime (fixed Task array, mutex scheduling, handshake dispatch, raw uint64_t args) with PTO2's ring buffer infrastructure while keeping explicit dependency management (no TensorMap).

  • Ring buffers: HeapRing, TaskRing, DepListPool for concurrent build+execute
  • Register-based dispatch: DATA_MAIN_BASE/COND protocol (same as tensormap_and_ringbuffer)
  • Scope-end batch publish: Tasks invisible until scope_end, enabling non-atomic dep wiring
  • Explicit add_dependency: No TensorMap lookup/insert overhead
  • PTOParam/Tensor args: Typed tensor args with HeapRing output allocation
  • dep_pool reclaim: Periodic reclamation of dependency pool entries from retired tasks

New API

PTO2TaskId t0 = pto2_rt_submit_aic_task(rt, kernel_id, params);
PTO2TaskId t1 = pto2_rt_submit_aiv_task(rt, kernel_id, params);
pto2_rt_add_dependency(rt, t0, t1);
PTO2_SCOPE(rt) { /* submit + wire deps, batch publish at scope_end */ }

Examples migrated

  • vector_example: 4-task diamond DAG (sim-tested PASS)
  • bgemm: 128-task tiled matmul with K-accumulation chain (sim-tested PASS)

New tests

  • paged_attention: Per-block paged attention (batch=256, 64 KV blocks/batch)
  • paged_attention_unroll: Unrolled variant with grouped blocks

Benchmark updated

  • tools/benchmark_rounds.sh: added -r/--runtime flag for runtime selection
  • benchmark-pr skill: auto-detects affected runtimes

Performance (Paged Attention Unroll, 10 rounds, Ascend 910)

Case Config aicpu_build_graph tensormap_and_ringbuffer Delta
Case1 b256, h16, d128, bs128 1395 us 1417 us -1.6%
Case2 b64, h64, d128, bs64 716 us 764 us -6.3%

The improvement comes from eliminating TensorMap lookup/insert in the orchestration submit path.

Test plan

  • vector_example simulation test PASS
  • bgemm simulation test PASS
  • paged_attention hardware test PASS (batch=256, all 524288 elements matched)
  • paged_attention_unroll Case1+Case2 hardware test PASS
  • tensormap_and_ringbuffer examples unaffected (no regression)
  • CI: ./ci.sh -p a2a3sim — 12/12 PASS
  • CI: ./ci.sh -p a2a3 -d 4-7 --parallel — 21/21 PASS

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the aicpu_build_graph runtime by integrating the advanced PTO2 architecture. The core purpose is to enhance performance and scalability through optimized task orchestration and resource management. This involves a fundamental shift in how tasks are defined, dependencies are managed, and intermediate buffers are handled, moving towards a more explicit and efficient device-side graph building process.

Highlights

  • Runtime Refactor to PTO2 Infrastructure: The aicpu_build_graph runtime has been refactored to utilize PTO2's ring buffer infrastructure, replacing the older fixed Task array, mutex scheduling, and raw uint64_t arguments with a more efficient and modern approach.
  • Explicit Dependency Management: The new design employs explicit dependency management via pto2_rt_add_dependency, eliminating the overhead of TensorMap lookups and inserts in the orchestration submit path, leading to performance improvements.
  • Scope-End Batch Publishing: Tasks are now batch-published at scope_end using PTO2_SCOPE, making them invisible until the scope concludes. This enables non-atomic dependency wiring within a scope and improves scheduling efficiency.
  • New API and Examples: A new API (pto2_rt_submit_aic_task, pto2_rt_submit_aiv_task, pto2_rt_add_dependency, PTO2_SCOPE) has been introduced. Existing examples like vector_example and bgemm have been migrated, and a new paged_attention_unroll test has been ported to the aicpu_build_graph runtime.
  • Performance Improvements: Benchmarks show performance gains, particularly in scenarios with higher task submission density, due to the elimination of TensorMap overhead.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

poursoul
poursoul previously approved these changes Mar 20, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and impressive refactoring of the aicpu_build_graph runtime. It replaces the old implementation with the new PTO2 ring buffer infrastructure, introducing explicit dependency management, scope-based batch publishing of tasks, and a register-based dispatch protocol. This eliminates the overhead of TensorMap lookups, leading to performance improvements as shown in the benchmarks. The code is well-structured, introducing new components like a decoupled orchestrator and scheduler, a lock-free MPMC ready queue, and robust ring buffer implementations with deadlock detection. The changes are extensive, touching everything from example orchestration logic and kernels to the core runtime implementation on host, AICPU, and AICore. The addition of a new, complex paged_attention_unroll test case demonstrates the capabilities of the new runtime. My review found a few minor areas for improvement, mainly related to code clarity and cleanup.

@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor-aicpu-build-graph branch 5 times, most recently from 278420b to d5c7387 Compare March 20, 2026 06:55
…blish

Replace the old aicpu_build_graph runtime (fixed Task array, mutex
scheduling, handshake dispatch, raw uint64_t args) with PTO2's ring
buffer infrastructure while keeping explicit dependency management
(no TensorMap).

- Ring buffers: HeapRing, TaskRing, DepListPool for concurrent build+execute
- Register-based dispatch: DATA_MAIN_BASE/COND protocol
- Scope-end batch publish: tasks invisible until scope_end
- Explicit add_dependency: no TensorMap lookup/insert overhead
- PTOParam/Tensor args: typed tensor args with HeapRing output allocation
- dep_pool reclaim: periodic reclamation prevents pool exhaustion

Examples migrated: vector_example, bgemm
New tests: paged_attention, paged_attention_unroll
Benchmark: added -r/--runtime flag to benchmark_rounds.sh

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor-aicpu-build-graph branch from d5c7387 to 01ada96 Compare March 20, 2026 07:14
@ChaoWao ChaoWao merged commit a21caa1 into hw-native-sys:main Mar 20, 2026
5 checks passed
@hw-native-sys-bot hw-native-sys-bot deleted the refactor-aicpu-build-graph branch March 22, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants