Fix: remove DEV_ALWAYS init cost from profiling timestamps#338
Conversation
First DEV_ALWAYS call on AICPU has ~50us initialization overhead. When profiling timestamps were printed inline via DEV_ALWAYS, this overhead inflated the measured orch_start→end interval. - Capture timestamps in variables before work, print together after - Move orch_stage_end capture before transition_requested_ store - Fix aicpu_build_graph log format (Thread=%d → Thread %d:) to match benchmark_rounds.sh parser
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the accuracy of performance profiling within the AICPU runtime by addressing an issue where an initial Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses an issue where logging overhead was skewing profiling measurements. The changes correctly refactor the code to capture timestamps before work and log them after, which will lead to more accurate profiling. Moving a logging call after a critical store operation is also a good improvement to reduce latency. The standardization of log formats is a welcome addition for consistency and easier parsing. The changes are well-implemented and consistent across the modified files.
Summary
DEV_ALWAYScall on AICPU has ~50us initialization overhead that was inflating profiling measurementsorch_stage_endcapture beforetransition_requested_store to avoid DEV_ALWAYS delaying the critical storeaicpu_build_graphlog format (Thread=%d→Thread %d:) to matchbenchmark_rounds.shparserBoth
aicpu_build_graph(3 locations) andtensormap_and_ringbuffer(4 locations) are fixed.Benchmark (tensormap_and_ringbuffer, device 10, 10 rounds trimmed avg)
Short-latency examples show ~50-80us improvement consistent with removing the init overhead from the measured interval. Longer examples show negligible change as expected.
Testing