Skip to content

Change bias initialization from 'embed' to 'heads'#371

Open
csgoogle wants to merge 4 commits intomainfrom
fixbiassharding
Open

Change bias initialization from 'embed' to 'heads'#371
csgoogle wants to merge 4 commits intomainfrom
fixbiassharding

Conversation

@csgoogle
Copy link
Copy Markdown
Collaborator

@csgoogle csgoogle commented Apr 6, 2026

Fix the bias sharding axis, it should be output axis instead of input one.

Results

Metric main fixbiassharding Δ
Compile time 1913.9s 1906.4s -7.5s
Inference time 1656.4s 1642.1s -14.3s (-0.9%)

Notes

  • No difference observed with tp=1 configs — improvement only surfaces when tensor parallelism is active, as the axis fixes reduce parameter all-gather overhead in MLP layers
  • Primary motivation for this change is correctness: incorrect sharding axes can cause OOM or numerical issues at other parallelism configs
  • Larger gains expected at tp=4 or tp=8 where parameter communication is a larger fraction of step time

Video Quality Comparison

Branch Video
main main.mp4
fixbiassharding fixbiassharding.mp4

PSNR/SSIM (frame-by-frame, 81 frames):

Metric Mean Min Max
PSNR 19.37 dB 18.83 20.17
SSIM 0.7884 0.7654 0.8043

Low PSNR/SSIM reflects floating point non-determinism from different sharding layouts across 50 denoising steps (bfloat16 + different collective patterns) — videos are visually identical.

Fix the bias sharding axis, it should be output axis instead of input one.
@csgoogle csgoogle requested a review from entrpn as a code owner April 6, 2026 10:09
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant