UPSTREAM PR #1318: chore: replace rand and srand at the library level#76
UPSTREAM PR #1318: chore: replace rand and srand at the library level#76
Conversation
These functions have global state, so they could interfere with application behavior.
OverviewAnalysis of stable-diffusion.cpp compared 49,765 functions across two versions, identifying 107 modified functions, 18 new functions, and 0 removed functions. The changes stem from a single commit replacing C-style rand/srand with C++ random number generation for improved thread safety and reproducibility. Binaries Analyzed:
Overall performance impact is negligible, with power consumption changes under 0.2% indicating effective performance neutrality despite individual function variations. Function Analysisstd::vector::end() (build.bin.sd-cli): Throughput time increased 306.67% (59.77ns → 243.07ns, +183.30ns). Response time increased 223.91% (81.86ns → 265.16ns, +183.30ns). This STL function regression appears compiler-driven, likely from disabled inlining. While called frequently (411 uses), absolute impact remains modest. std::vector<sd_lora_t>::end() (build.bin.sd-server): Throughput time improved 75.41% (243.07ns → 59.78ns, -183.29ns). Response time improved 69.44% (263.94ns → 80.65ns, -183.29ns). Compiler optimizations improved this LoRA parameter iteration function. ggml_threadpool_params_default (build.bin.sd-cli): Throughput time improved 58.40% (217.48ns → 90.47ns, -127.01ns). Response time improved 45.46% (279.79ns → 152.59ns, -127.20ns). GGML submodule optimizations reduced threadpool initialization overhead. ggml_compute_forward_map_custom3 (build.bin.sd-server): Throughput time improved 35.05% (219.25ns → 142.41ns, -76.84ns). Response time improved 32.91% (233.99ns → 156.98ns, -77.01ns). Custom operation handling benefits from more efficient RNG implementation. apply_binary_op (build.bin.sd-cli): Throughput time improved 6.15% (1286.26ns → 1207.13ns, -79.13ns). Response time improved 4.26% (2362.80ns → 2262.11ns, -100.69ns). This frequently-called tensor addition operation shows modest but meaningful improvement. Other analyzed functions showed mixed compiler-driven optimizations in STL operations (string construction, regex handling, vector reallocation) with changes ranging from -50% to +113%, but absolute impacts remained under 100ns per call. Additional FindingsCore ML inference operations (matrix multiplication, convolution, attention) remain unchanged. Performance variations are predominantly compiler artifacts affecting peripheral functions (initialization, CLI parsing, memory management) rather than inference hot paths. The RNG replacement successfully achieves thread safety and reproducibility goals without compromising computational efficiency, as confirmed by near-zero net power consumption changes. 🔎 Full breakdown: Loci Inspector |
dd19ab8 to
98460a7
Compare
Note
Source pull request: leejet/stable-diffusion.cpp#1318
These functions have global state, so they could interfere with application behavior.
It would arguably be more correct to use
std::default_random_device, but that seemed a bit overkill for this.