sync(a2a3): unify scheduling queues and optimize core allocation performance by Songyangyang18 · Pull Request #327 · hw-native-sys/simpler

Songyangyang18 · 2026-03-19T02:06:12Z

Key Changes

Scheduler Queue Unification

Remove multi-resource shape classification (AIC_ONLY/AIV_X1/AIV_X2/AIC_AIV_X1/AIC_AIV_X2), simplify to single UNIFIED type
Merge 5 independent ready queues into 1 global queue, reducing queue traversal and lock contention overhead
Unify local buffer structure, simplify task enqueue logic

Break Cluster Resource Constraints

Change AIC and AIV cores from cluster-bound to global resource pool management
Independently allocate required AIC and AIV during task dispatch, no longer requiring same cluster
Significantly improve resource utilization for mixed tasks (AIC+AIV)

Core Allocation Performance Optimization

Add take_last_idle() method, optimize core allocation from O(n) to O(1)
Add dispatch_from_idle_pool() fast path function, avoid double state updates
Fix bug where double decrement caused incorrect idle core count

gemini-code-assist · 2026-03-19T02:06:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the scheduling and resource allocation mechanisms to enhance performance and flexibility. By unifying scheduling queues and breaking traditional cluster-based resource constraints, it streamlines task dispatch and improves the utilization of AIC and AIV cores. The changes also introduce specific optimizations for core allocation, leading to a more efficient and robust system.

Highlights

Scheduler Queue Unification: Consolidated multiple resource shape classifications into a single UNIFIED type and merged five independent ready queues into one global queue, reducing overhead.
Resource Constraint Relaxation: Transformed AIC and AIV core management from cluster-bound to a global resource pool, enabling independent allocation and improving utilization for mixed tasks.
Core Allocation Optimization: Introduced take_last_idle() for O(1) core allocation and dispatch_from_idle_pool() for a fast path, while also correcting an idle core count bug.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2026-03-19T02:09:37Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

writalnaie · 2026-03-19T02:29:33Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully unifies the scheduling queues and optimizes core allocation performance, aligning with the stated objectives. The changes effectively remove multi-resource shape classification, merge independent ready queues into a single global queue, and introduce fast-path core allocation methods. The code refactoring for PTO2ResourceShape and related functions is clean and directly addresses the goal of simplifying the scheduling model. The introduction of take_last_idle() and dispatch_from_idle_pool() provides the intended performance benefits. Overall, the changes are well-implemented and improve the system's efficiency and resource utilization.

gemini-code-assist · 2026-03-19T02:31:07Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h

+    PTO2TaskSlotState* get_ready_task(PTO2ResourceShape) {
+        return ready_queues[0].pop();


The get_ready_task(PTO2ResourceShape) overload now takes a PTO2ResourceShape parameter but explicitly ignores it. This can be misleading for callers who might expect shape-based filtering. If this overload is no longer needed, consider removing it. If it's kept for backward compatibility, a comment explaining its behavior (e.g., "Legacy adapter, always returns from unified queue") would improve clarity.

PTO2TaskSlotState* get_ready_task() { return ready_queues[0].pop(); } // Legacy adapter, always returns from unified queue PTO2TaskSlotState* get_ready_task(PTO2ResourceShape) { return ready_queues[0].pop(); }

gemini-code-assist · 2026-03-19T02:31:07Z

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

+    void dispatch_from_idle_pool(
+        Runtime* runtime, CoreStateTracker& tracker, int32_t* executing_reg_task_ids,
+        CoreType core_type, PTO2TaskSlotState& slot_state,
+        PTO2SubtaskSlot subslot
+#if PTO2_PROFILING
+        , bool profiling_enabled, int32_t thread_idx
+#endif
+    ) {
+        CoreTypeTracker& ct = tracker.by_type[static_cast<int32_t>(core_type)];
+        int32_t core_id = ct.take_last_idle();
+        if (core_id < 0) return;
+
+        PTO2DispatchPayload& payload = s_pto2_payload_per_core[core_id];
+        PTO2TaskDescriptor& task = *slot_state.task;
+        int32_t slot_idx = static_cast<int32_t>(subslot);
+        build_pto2_payload(payload, task.kernel_id[slot_idx], *slot_state.payload);
+        executing_subslot_by_core_[core_id] = subslot;
+        executing_slot_state_by_core_[core_id] = &slot_state;
+#if PTO2_PROFILING
+        if (profiling_enabled) {
+            dispatch_timestamps_[core_id] = get_sys_cnt_aicpu();
+            if (core_dispatch_counts_[core_id] >= PLATFORM_PROF_BUFFER_SIZE) {
+                perf_aicpu_switch_buffer(runtime, core_id, thread_idx);
+                core_dispatch_counts_[core_id] = 0;
+            }
+            core_dispatch_counts_[core_id]++;
+        }
+#endif
+        dispatch_seq_by_core_[core_id]++;
+        uint32_t reg_task_id = dispatch_seq_by_core_[core_id] & TASK_ID_MASK;
+        while (reg_task_id == AICORE_IDLE_TASK_ID ||
+            (reg_task_id + 1) == AICORE_EXIT_SIGNAL) {
+            dispatch_seq_by_core_[core_id]++;
+            reg_task_id = dispatch_seq_by_core_[core_id] & TASK_ID_MASK;
+        }
+        write_reg(core_id_to_reg_addr_[core_id], RegId::DATA_MAIN_BASE, static_cast<uint64_t>(reg_task_id));
+
+        tracker.core_idle[core_id] = false;
+        executing_reg_task_ids[core_id] = reg_task_id;
+    }


The dispatch_from_idle_pool method uses ct.take_last_idle() to select a core. While this provides an O(1) allocation, it always picks the last available core from the idle array. Depending on how cores are added to this array, this might not always be the most optimal choice in terms of cache locality or other performance characteristics if multiple idle cores are available. This is a design trade-off for simplicity and O(1) performance, but it's worth noting that a more sophisticated core selection strategy (e.g., based on recent usage or physical proximity) could potentially yield further performance gains in specific scenarios, though it would likely increase complexity.

discard mix

160b6ad

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

ChaoWao marked this pull request as draft March 19, 2026 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync(a2a3): unify scheduling queues and optimize core allocation performance#327

sync(a2a3): unify scheduling queues and optimize core allocation performance#327
Songyangyang18 wants to merge 1 commit intohw-native-sys:mainfrom
Songyangyang18:dismix

Songyangyang18 commented Mar 19, 2026

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

writalnaie commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Uh oh!

gemini-code-assist bot Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		PTO2TaskSlotState* get_ready_task(PTO2ResourceShape) {
		return ready_queues[0].pop();

Conversation

Songyangyang18 commented Mar 19, 2026

Key Changes

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

writalnaie commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants