Refactor aicpu_build_graph: ring buffers, explicit deps, scope-end publish by hw-native-sys-bot · Pull Request #333 · hw-native-sys/simpler

hw-native-sys-bot · 2026-03-20T01:10:58Z

Summary

Replace the old aicpu_build_graph runtime (fixed Task array, mutex scheduling, handshake dispatch, raw uint64_t args) with PTO2's ring buffer infrastructure while keeping explicit dependency management (no TensorMap).

Ring buffers: HeapRing, TaskRing, DepListPool for concurrent build+execute
Register-based dispatch: DATA_MAIN_BASE/COND protocol (same as tensormap_and_ringbuffer)
Scope-end batch publish: Tasks invisible until scope_end, enabling non-atomic dep wiring
Explicit add_dependency: No TensorMap lookup/insert overhead
PTOParam/Tensor args: Typed tensor args with HeapRing output allocation
dep_pool reclaim: Periodic reclamation of dependency pool entries from retired tasks

New API

PTO2TaskId t0 = pto2_rt_submit_aic_task(rt, kernel_id, params);
PTO2TaskId t1 = pto2_rt_submit_aiv_task(rt, kernel_id, params);
pto2_rt_add_dependency(rt, t0, t1);
PTO2_SCOPE(rt) { /* submit + wire deps, batch publish at scope_end */ }

Examples migrated

vector_example: 4-task diamond DAG (sim-tested PASS)
bgemm: 128-task tiled matmul with K-accumulation chain (sim-tested PASS)

New tests

paged_attention: Per-block paged attention (batch=256, 64 KV blocks/batch)
paged_attention_unroll: Unrolled variant with grouped blocks

Benchmark updated

tools/benchmark_rounds.sh: added -r/--runtime flag for runtime selection
benchmark-pr skill: auto-detects affected runtimes

Performance (Paged Attention Unroll, 10 rounds, Ascend 910)

Case	Config	aicpu_build_graph	tensormap_and_ringbuffer	Delta
Case1	b256, h16, d128, bs128	1395 us	1417 us	-1.6%
Case2	b64, h64, d128, bs64	716 us	764 us	-6.3%

The improvement comes from eliminating TensorMap lookup/insert in the orchestration submit path.

Test plan

vector_example simulation test PASS
bgemm simulation test PASS
paged_attention hardware test PASS (batch=256, all 524288 elements matched)
paged_attention_unroll Case1+Case2 hardware test PASS
tensormap_and_ringbuffer examples unaffected (no regression)
CI: ./ci.sh -p a2a3sim — 12/12 PASS
CI: ./ci.sh -p a2a3 -d 4-7 --parallel — 21/21 PASS

🤖 Generated with Claude Code

gemini-code-assist · 2026-03-20T01:12:27Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the aicpu_build_graph runtime by integrating the advanced PTO2 architecture. The core purpose is to enhance performance and scalability through optimized task orchestration and resource management. This involves a fundamental shift in how tasks are defined, dependencies are managed, and intermediate buffers are handled, moving towards a more explicit and efficient device-side graph building process.

Highlights

Runtime Refactor to PTO2 Infrastructure: The aicpu_build_graph runtime has been refactored to utilize PTO2's ring buffer infrastructure, replacing the older fixed Task array, mutex scheduling, and raw uint64_t arguments with a more efficient and modern approach.
Explicit Dependency Management: The new design employs explicit dependency management via pto2_rt_add_dependency, eliminating the overhead of TensorMap lookups and inserts in the orchestration submit path, leading to performance improvements.
Scope-End Batch Publishing: Tasks are now batch-published at scope_end using PTO2_SCOPE, making them invisible until the scope concludes. This enables non-atomic dependency wiring within a scope and improves scheduling efficiency.
New API and Examples: A new API (pto2_rt_submit_aic_task, pto2_rt_submit_aiv_task, pto2_rt_add_dependency, PTO2_SCOPE) has been introduced. Existing examples like vector_example and bgemm have been migrated, and a new paged_attention_unroll test has been ported to the aicpu_build_graph runtime.
Performance Improvements: Benchmarks show performance gains, particularly in scenarios with higher task submission density, due to the elimination of TensorMap overhead.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a significant and impressive refactoring of the aicpu_build_graph runtime. It replaces the old implementation with the new PTO2 ring buffer infrastructure, introducing explicit dependency management, scope-based batch publishing of tasks, and a register-based dispatch protocol. This eliminates the overhead of TensorMap lookups, leading to performance improvements as shown in the benchmarks. The code is well-structured, introducing new components like a decoupled orchestrator and scheduler, a lock-free MPMC ready queue, and robust ring buffer implementations with deadlock detection. The changes are extensive, touching everything from example orchestration logic and kernels to the core runtime implementation on host, AICPU, and AICore. The addition of a new, complex paged_attention_unroll test case demonstrates the capabilities of the new runtime. My review found a few minor areas for improvement, mainly related to code clarity and cleanup.

src/a2a3/runtime/aicpu_build_graph/host/runtime_compile_info.cpp

src/a2a3/runtime/aicpu_build_graph/runtime/data_type.h

examples/a2a3/aicpu_build_graph/vector_example/kernels/orchestration/orchestration.cpp

…blish Replace the old aicpu_build_graph runtime (fixed Task array, mutex scheduling, handshake dispatch, raw uint64_t args) with PTO2's ring buffer infrastructure while keeping explicit dependency management (no TensorMap). - Ring buffers: HeapRing, TaskRing, DepListPool for concurrent build+execute - Register-based dispatch: DATA_MAIN_BASE/COND protocol - Scope-end batch publish: tasks invisible until scope_end - Explicit add_dependency: no TensorMap lookup/insert overhead - PTOParam/Tensor args: typed tensor args with HeapRing output allocation - dep_pool reclaim: periodic reclamation prevents pool exhaustion Examples migrated: vector_example, bgemm New tests: paged_attention, paged_attention_unroll Benchmark: added -r/--runtime flag to benchmark_rounds.sh Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

poursoul previously approved these changes Mar 20, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

hw-native-sys-bot dismissed poursoul’s stale review via 6dbe361 March 20, 2026 01:26

hw-native-sys-bot force-pushed the refactor-aicpu-build-graph branch 5 times, most recently from 278420b to d5c7387 Compare March 20, 2026 06:55

hw-native-sys-bot force-pushed the refactor-aicpu-build-graph branch from d5c7387 to 01ada96 Compare March 20, 2026 07:14

ChaoWao approved these changes Mar 20, 2026

View reviewed changes

jvjhfhg approved these changes Mar 20, 2026

View reviewed changes

ChaoWao merged commit a21caa1 into hw-native-sys:main Mar 20, 2026
5 checks passed

hw-native-sys-bot deleted the refactor-aicpu-build-graph branch March 22, 2026 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor aicpu_build_graph: ring buffers, explicit deps, scope-end publish#333

Refactor aicpu_build_graph: ring buffers, explicit deps, scope-end publish#333
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:refactor-aicpu-build-graph

hw-native-sys-bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hw-native-sys-bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New API

Examples migrated

New tests

Benchmark updated

Performance (Paged Attention Unroll, 10 rounds, Ascend 910)

Test plan

Uh oh!

gemini-code-assist bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hw-native-sys-bot commented Mar 20, 2026 •

edited

Loading