sync(a2a3): unify scheduling queues and optimize core allocation performance#327
sync(a2a3): unify scheduling queues and optimize core allocation performance#327Songyangyang18 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the scheduling and resource allocation mechanisms to enhance performance and flexibility. By unifying scheduling queues and breaking traditional cluster-based resource constraints, it streamlines task dispatch and improves the utilization of AIC and AIV cores. The changes also introduce specific optimizations for core allocation, leading to a more efficient and robust system. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request successfully unifies the scheduling queues and optimizes core allocation performance, aligning with the stated objectives. The changes effectively remove multi-resource shape classification, merge independent ready queues into a single global queue, and introduce fast-path core allocation methods. The code refactoring for PTO2ResourceShape and related functions is clean and directly addresses the goal of simplifying the scheduling model. The introduction of take_last_idle() and dispatch_from_idle_pool() provides the intended performance benefits. Overall, the changes are well-implemented and improve the system's efficiency and resource utilization.
| PTO2TaskSlotState* get_ready_task(PTO2ResourceShape) { | ||
| return ready_queues[0].pop(); |
There was a problem hiding this comment.
The get_ready_task(PTO2ResourceShape) overload now takes a PTO2ResourceShape parameter but explicitly ignores it. This can be misleading for callers who might expect shape-based filtering. If this overload is no longer needed, consider removing it. If it's kept for backward compatibility, a comment explaining its behavior (e.g., "Legacy adapter, always returns from unified queue") would improve clarity.
PTO2TaskSlotState* get_ready_task() {
return ready_queues[0].pop();
}
// Legacy adapter, always returns from unified queue
PTO2TaskSlotState* get_ready_task(PTO2ResourceShape) {
return ready_queues[0].pop();
}| void dispatch_from_idle_pool( | ||
| Runtime* runtime, CoreStateTracker& tracker, int32_t* executing_reg_task_ids, | ||
| CoreType core_type, PTO2TaskSlotState& slot_state, | ||
| PTO2SubtaskSlot subslot | ||
| #if PTO2_PROFILING | ||
| , bool profiling_enabled, int32_t thread_idx | ||
| #endif | ||
| ) { | ||
| CoreTypeTracker& ct = tracker.by_type[static_cast<int32_t>(core_type)]; | ||
| int32_t core_id = ct.take_last_idle(); | ||
| if (core_id < 0) return; | ||
|
|
||
| PTO2DispatchPayload& payload = s_pto2_payload_per_core[core_id]; | ||
| PTO2TaskDescriptor& task = *slot_state.task; | ||
| int32_t slot_idx = static_cast<int32_t>(subslot); | ||
| build_pto2_payload(payload, task.kernel_id[slot_idx], *slot_state.payload); | ||
| executing_subslot_by_core_[core_id] = subslot; | ||
| executing_slot_state_by_core_[core_id] = &slot_state; | ||
| #if PTO2_PROFILING | ||
| if (profiling_enabled) { | ||
| dispatch_timestamps_[core_id] = get_sys_cnt_aicpu(); | ||
| if (core_dispatch_counts_[core_id] >= PLATFORM_PROF_BUFFER_SIZE) { | ||
| perf_aicpu_switch_buffer(runtime, core_id, thread_idx); | ||
| core_dispatch_counts_[core_id] = 0; | ||
| } | ||
| core_dispatch_counts_[core_id]++; | ||
| } | ||
| #endif | ||
| dispatch_seq_by_core_[core_id]++; | ||
| uint32_t reg_task_id = dispatch_seq_by_core_[core_id] & TASK_ID_MASK; | ||
| while (reg_task_id == AICORE_IDLE_TASK_ID || | ||
| (reg_task_id + 1) == AICORE_EXIT_SIGNAL) { | ||
| dispatch_seq_by_core_[core_id]++; | ||
| reg_task_id = dispatch_seq_by_core_[core_id] & TASK_ID_MASK; | ||
| } | ||
| write_reg(core_id_to_reg_addr_[core_id], RegId::DATA_MAIN_BASE, static_cast<uint64_t>(reg_task_id)); | ||
|
|
||
| tracker.core_idle[core_id] = false; | ||
| executing_reg_task_ids[core_id] = reg_task_id; | ||
| } |
There was a problem hiding this comment.
The dispatch_from_idle_pool method uses ct.take_last_idle() to select a core. While this provides an O(1) allocation, it always picks the last available core from the idle array. Depending on how cores are added to this array, this might not always be the most optimal choice in terms of cache locality or other performance characteristics if multiple idle cores are available. This is a design trade-off for simplicity and O(1) performance, but it's worth noting that a more sophisticated core selection strategy (e.g., based on recent usage or physical proximity) could potentially yield further performance gains in specific scenarios, though it would likely increase complexity.
Key Changes