Conversation
|
Review updated until commit 0c32eb6 Description
|
| Relevant files | |||||||
|---|---|---|---|---|---|---|---|
| Enhancement |
| ||||||
| Configuration changes |
| ||||||
| Tests |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Potential Division by Zero
|
|
!test |
Greptile SummaryThis PR implements a TMA (Tensor Memory Accelerator) transpose auto-scheduler for GPU kernels. The implementation adds a new scheduler that uses TMA load/store operations with swizzled shared memory to optimize transpose operations. Key Changes:
Issues Found:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[TransposeScheduler::computeHeuristics] --> B{TmaTranspose enabled?}
B -->|Yes| C[transpose::tma::getTransposeHeuristics]
B -->|No| D[transpose::non_tma::getTransposeHeuristics]
C --> E{Returns valid params?}
E -->|Yes| F[Use TMA scheduler]
E -->|No| D
D --> G[Return non-TMA params]
F --> H[scheduleTranspose with TMA]
H --> I[Step 1: TMA Tiling<br/>Split and reorder for BIDx, tile_2, tile_1]
I --> J[Step 2: Schedule TMA Store<br/>Set Bulk parallel on tile dims]
J --> K[Step 3: Schedule Input Shared Memory<br/>Apply XOR swizzle for TMA load]
K --> L[Step 4: Schedule Register TVs<br/>Split by chunks_per_thread and elements_per_chunk<br/>Set TIDx and Unroll parallelization]
L --> M[inlineMost]
C --> N[Calculate Heuristics]
N --> O[Compute max_input_dtype_size and n_input]
O --> P[tile_size2 = 128 / max_input_dtype_size]
P --> Q{n_input == 1?}
Q -->|Yes| R[tile_size1 = tile_size2 * 2<br/>target_bdimx = 256]
Q -->|No| S[tile_size1 = tile_size2<br/>target_bdimx = 128]
R --> T[chunks_per_thread = tile_size1 * 8 / target_bdimx]
S --> T
T --> U[elements_per_chunk = 16 / max_input_dtype_size]
Last reviewed commit: 0c32eb6 |
Additional Comments (2)
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
!test |
| // the inner most dim, range is [1, 8] | ||
| // bdimx = tile_size1 * 8 / chunks_per_thread | ||
| const int64_t target_bdimx = (n_input == 1) ? 256 : 128; | ||
| tparams->chunks_per_thread = tparams->tile_size1 * 8 / target_bdimx; |
There was a problem hiding this comment.
potential division by zero if target_bdimx is zero (though unlikely given the constants)
Additional Comments (2)
|
…user into llu/transpose_tma_auto2
| // the inner most dim, range is [1, 8] | ||
| // bdimx = tile_size1 * 8 / chunks_per_thread | ||
| const int64_t target_bdimx = (n_input == 1) ? 256 : 128; | ||
| tparams->chunks_per_thread = tparams->tile_size1 * 8 / target_bdimx; |
There was a problem hiding this comment.
if tile_size1 is 0 (which could happen if max_input_dtype_size > 128), chunks_per_thread will be 0, causing issues with split at line 217
| tparams->chunks_per_thread = tparams->tile_size1 * 8 / target_bdimx; | |
| NVF_CHECK(tparams->tile_size1 > 0, "tile_size1 must be positive"); | |
| tparams->chunks_per_thread = tparams->tile_size1 * 8 / target_bdimx; |
| if (max_output_dims == 0 && max_input_dims == 0) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
consider adding validation or logging when both max_output_dims and max_input_dims are 0 before early return
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
The heuristics is a basic version and current performance is in this doc.