Skip to content

test: 2.3x faster test suite via direct MPI execution and reduced timesteps#1331

Open
sbryngelson wants to merge 28 commits intoMFlowCode:masterfrom
sbryngelson:threading
Open

test: 2.3x faster test suite via direct MPI execution and reduced timesteps#1331
sbryngelson wants to merge 28 commits intoMFlowCode:masterfrom
sbryngelson:threading

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Mar 20, 2026

Summary

This PR speeds up the MFC test suite by 1.6-2.4x across all CI runners through two main optimizations, plus fixes several pre-existing bugs surfaced by the new test execution path.

Test suite speedup

  • Halve timesteps: Reduce Nt from 50 to 25, cutting simulation time for all 560+ tests. Golden files regenerated.
  • Direct MPI execution: Instead of spawning ./mfc.sh run per test (which re-bootstraps Python, re-parses case files, generates shell scripts, runs syscheck), the test runner now generates .inp files in-memory and calls the MPI launcher directly. Per-test overhead drops from ~12s to ~2s.

Self-hosted runner test execution times (excluding SLURM queue)

Runner Master PR Speedup
Phoenix CPU 65m 29m 2.2x
Phoenix GPU-ACC 71m 29m 2.4x
Phoenix GPU-OMP 69m 28m 2.4x
Frontier CPU 35m 19m 1.8x
Frontier GPU-ACC (avg shard) 11m 8m 1.4x
Frontier GPU-OMP (avg shard) 11m 7m 1.6x
Frontier-AMD GPU-OMP (avg shard) 12m 10m 1.2x
Frontier-AMD CPU 30m 18m 1.6x

Bug fixes

Bypassing the batch templates for direct MPI execution required replicating several system-specific behaviors. This surfaced (and fixed) pre-existing issues:

  • s_axis optional args (m_boundary_common.fpp): pb_in/mv_in were non-optional in s_axis() but passed as absent optionals from s_populate_variables_buffers in post_process. CCE dereferenced the null descriptor causing a segfault on 3D cylindrical post-process tests. Added optional attribute and present() guards.
  • Stack limit (case.py): Set soft stack limit to hard limit via resource.setrlimit at module load, matching ulimit -s unlimited in batch templates. Fixes FPE in viscous cylindrical tests on CCE.
  • macOS setrlimit: Wrapped in try/except for macOS where the kernel restricts stack limit changes.
  • GPU srun flags (mpi_config.gpu_flags): Added per-template gpu_flags list for systems that need explicit --gpus-per-task/--gpu-bind in srun (tuo, santis). Frontier service partition provides GPUs without GRES, so no srun GPU flags needed there.
  • GPU environment variables (mpi_config.gpu_env): Added gpu_env dict for env vars conditional on GPU builds (e.g., MPICH_GPU_SUPPORT_ENABLED=1 for Frontier CCE). CPU builds explicitly set MPICH_GPU_SUPPORT_ENABLED=0 to override inherited values. Frontier-AMD (flang) does not set this variable as it uses AFAR/ROCm GPU-aware MPI instead of Cray GTL.
  • System-specific env vars: Moved HSA_XNACK, FI_CXI_*, NV_ACC_*, NVCOMPILER_ACC_* into mpi_config.env in respective templates (tuo, santis) so the direct test runner inherits them.
  • srun topology: Use --nodes 1 --ntasks-per-node N instead of --ntasks N for srun, matching template dispatch and providing correct node topology for GPU binding.
  • RDMA golden files: Regenerated RDMA MPI golden files that were missed during the Nt=50→25 regeneration (they require --rdma-mpi flag to generate).
  • GPU build detection: Fixed bool(ARG("gpu")) returning True for --no-gpu (since bool("no") is True in Python).

Test plan

  • All 560+ tests pass on Frontier CCE GPU (OpenACC) — verified locally on service partition
  • All 5 previously-failing 3D cylindrical tests pass (post-process segfault + viscous FPE)
  • 3D RDMA MPI 2-rank test passes on Frontier GPU
  • CPU tests pass on Frontier (no GTL error)
  • macOS CI passes (setrlimit try/except)
  • Full CI green: GitHub (ubuntu, macOS), Phoenix (CPU, GPU-ACC, GPU-OMP), Frontier (CPU, GPU-ACC, GPU-OMP), Frontier-AMD (CPU, GPU-OMP)

sbryngelson and others added 7 commits March 20, 2026 12:58
Instead of spawning `./mfc.sh run` per test (which re-bootstraps the
Python toolchain, re-parses the case file, generates a shell script,
runs syscheck, etc.), the test runner now:
- Generates .inp files directly from the in-memory parameter dict
- Calls the MPI launcher on pre_process/simulation binaries directly
Each .mako template now defines a `mpi_config` dict (binary, flags,
env) that both the template and the test runner read from, keeping
MPI configuration in one place per system.
Full 1D suite (156 tests, -j 10): 3m33s -> 1m03s (3.4x faster)
Full test suite (551 tests, -j 10): 29m26s -> 21m27s (27% faster)
Per-test overhead: ~12s -> ~2s for simple tests
- Reduce Nt from 50 to 25 (halves all test timesteps)
- Reduce example test timestep cap from 50 to 25
- Cap IBM STL 3D grid from 99x49x49 to 29x29x29
- Regenerate all golden files
Full suite (551 tests, -j 10): 21m27s -> 12m54s
The coverage cache builder had its own hardcoded MPI launcher detection
that ignored the -c <computer> passthrough. On systems where flux is
required (e.g. Tuolumne), it would silently use mpirun instead.
Now uses _get_mpi_config() and _mpi_cmd() from case.py, consistent
with TestCase.run().
- _extract_mpi_config: use 'mpi_config =' pattern to avoid matching
  usage sites; wrap ast.literal_eval in try/except for malformed
  templates; use find() instead of index() to avoid ValueError
- TestCase.run(): add try/except for TimeoutExpired, SubprocessError,
  OSError around subprocess.run(); add 3600s timeout
- _get_mpi_config: defensive 'ARG("--") or []' for None safety
- test.py: add missing returncode check for --test-all post_process
  path (pre-existing bug)
The helpers.mako mpi_cmd hardcoded 'flux run' but tuo.mako's
mpi_config flags already include 'run'. Let flags carry the
subcommand, matching how _mpi_cmd in Python handles it.
post_process calls s_populate_variables_buffers without the optional
pb_in/mv_in arguments, but the QBMM guard (qbmm .and. .not. polytropic)
could still evaluate to true, causing access to an absent optional
argument. This is undefined behavior caught by gfortran -fcheck=bounds.

Add present(pb_in) to all 6 call sites of s_qbmm_extrapolation.
Post_process computes the number of save files from t_stop/t_save.
With t_step_stop=25, the simulation produces fewer saves than the
original t_stop implies, causing 'lustre_N.dat is missing' errors.
Clamp t_stop = t_save so post_process only reads saves that exist.
The test_all path runs post_process as a best-effort smoke test.
On master, post_process failures (e.g. missing restart files for
adaptive-dt examples) are silently ignored. Match this behavior
to avoid surfacing pre-existing post_process issues.
Adaptive-dt examples use t_stop/t_save instead of t_step_stop.
With large adaptive steps, save indices can be non-consecutive,
causing post_process to abort on missing restart files.

Clamp t_stop = t_save so only save indices 0 and 1 are produced
(always consecutive). Regenerate golden files for affected tests.

Also re-add returncode check for --test-all post_process pass,
now that the underlying issue is fixed.
Missed this adaptive-dt example in the previous golden file
regeneration.
@sbryngelson sbryngelson marked this pull request as ready for review March 21, 2026 03:05
@codecov
Copy link

codecov bot commented Mar 21, 2026

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.73%. Comparing base (6120810) to head (aa261e5).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/common/m_boundary_common.fpp 0.00% 1 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1331      +/-   ##
==========================================
- Coverage   45.01%   44.73%   -0.28%     
==========================================
  Files          70       70              
  Lines       20562    20590      +28     
  Branches     1962     1962              
==========================================
- Hits         9255     9210      -45     
- Misses      10179    10254      +75     
+ Partials     1128     1126       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson marked this pull request as draft March 21, 2026 14:04
Spencer Bryngelson added 3 commits March 21, 2026 14:38
Two root causes for 5 failing 3D Cylindrical tests on Frontier CCE:

1. Post-process segfault (non-viscous tests 301B9153, 128954AD):
   s_axis() declared pb_in/mv_in as non-optional, but they are passed
   as absent optionals from s_populate_variables_buffers in post_process.
   CCE dereferences the null descriptor; gfortran tolerates it. Added
   'optional' attribute and present() guards, matching peer routines
   like s_symmetry and s_periodic.

2. Simulation FPE (viscous tests 07C33719, 939D6718, 09623DE3):
   The direct MPI execution path in the test runner bypasses the batch
   template which sets 'ulimit -s unlimited'. CCE-compiled viscous
   cylindrical binaries overflow the default stack. Set the soft stack
   limit to the hard limit at module load via resource.setrlimit.
@sbryngelson sbryngelson marked this pull request as ready for review March 21, 2026 21:44
Spencer Bryngelson added 8 commits March 21, 2026 19:21
The direct MPI execution path was missing GPU binding flags that
the batch templates add outside of mpi_config. For multi-rank GPU
tests (e.g. RDMA MPI), this caused Bus errors on Frontier because
srun didn't bind GPUs to tasks.

Add --gpus-per-task/--gpu-bind to _mpi_cmd for srun, jsrun, and
flux when GPUs are in use, matching what each template does.

Also regenerate RDMA MPI golden files (FA4D8FEF, 1B300F28,
2C9844EF) which were missed when Nt was halved from 50 to 25,
since they require --rdma-mpi to be generated.
The Frontier templates use '--nodes 1 --ntasks-per-node N' for srun,
not '--ntasks N'. With bare --ntasks, --gpu-bind closest lacks the
node topology needed for correct GPU binding, causing Bus errors
on multi-rank GPU tests (3D RDMA MPI).
The Frontier template sets MPICH_GPU_SUPPORT_ENABLED=1 for GPU runs,
but this was outside mpi_config so the direct test runner never set
it. Without it, GPU-aware MPI (RDMA) cannot handle device pointers,
causing Bus errors on multi-rank GPU tests.
The direct MPI execution path was missing GPU binding flags that
the batch templates add outside of mpi_config. For multi-rank GPU
tests (e.g. RDMA MPI), this caused Bus errors on Frontier because
srun didn't bind GPUs to tasks.

Add --gpus-per-task/--gpu-bind to _mpi_cmd for srun, jsrun, and
flux when GPUs are in use, matching what each template does.

Use ARG('gpu') (build flag) instead of the gpus device set to
determine GPU mode. On Frontier, no specific GPU IDs are passed
(-g), but the build is GPU-enabled (--gpu acc). MPICH_GPU_SUPPORT
and srun GPU flags must be set based on the build configuration,
not detected device IDs.

Also regenerate RDMA MPI golden files (FA4D8FEF, 1B300F28,
2C9844EF) which were missed when Nt was halved from 50 to 25,
since they require --rdma-mpi to be generated.
The direct MPI execution path was missing GPU binding flags and
environment variables that the batch templates set outside of
mpi_config. For multi-rank GPU tests (e.g. RDMA MPI), this caused
Bus errors because srun lacked GPU task binding and MPICH lacked
GPU support enablement.

Add gpu_flags list to mpi_config in templates that need explicit
srun/flux GPU flags (frontier, frontier_amd, tuo, santis). Systems
that handle GPU binding at the SBATCH level (delta, bridges2, oscar,
phoenix, hipergator) don't need gpu_flags.

Move environment variables (MPICH_GPU_SUPPORT_ENABLED, HSA_XNACK,
FI_CXI_*, NV_ACC_*, NVCOMPILER_ACC_*) into mpi_config.env in each
template so the test runner inherits them per-system. All env vars
are benign for CPU-only runs.

Also use --nodes 1 --ntasks-per-node N for srun (matching templates)
instead of bare --ntasks N, which lacked node topology for GPU bind.
Pre_process and post_process are CPU-only at runtime even in GPU
builds. Passing --gpus-per-task to their srun calls consumes GPU
GRES slots unnecessarily, causing 'Invalid generic resource'
errors when running parallel tests with --test-all on allocations
with limited GPU GRES (e.g. Frontier service partition with -n 8).
Frontier service partition provides GPUs to all tasks without GRES.
Adding --gpus-per-task caused 'Invalid generic resource' errors
when running parallel --test-all tests. Empty gpu_flags for Frontier
lets the allocation handle GPU access. Tuo and Santis keep their
gpu_flags as their schedulers require explicit GPU binding.
MPICH_GPU_SUPPORT_ENABLED=1 in mpi_config.env was applied
unconditionally, breaking CPU test runs on Frontier where the GTL
library is not linked. The template conditionally sets it based on
gpu_enabled, so the test runner must do the same.

Add gpu_env dict to mpi_config for env vars that should only be set
for GPU builds. Move MPICH_GPU_SUPPORT_ENABLED from env to gpu_env
in frontier, frontier_amd, and tuo templates.
Spencer Bryngelson added 3 commits March 22, 2026 20:41
The parent environment can inherit MPICH_GPU_SUPPORT_ENABLED=1 from
a previous GPU build on the same CI runner. The template explicitly
sets it to 0 for CPU builds, so mpi_config.env must do the same to
override any inherited value. gpu_env then overrides back to 1 for
GPU builds.
bool('no') is True in Python, so --no-gpu was incorrectly treated
as a GPU build. Check against explicit non-GPU values instead.
AMD flang on Frontier uses AFAR/ROCm GPU-aware MPI, not Cray's GTL.
The module loading script explicitly unsets MPICH_GPU_SUPPORT_ENABLED.
Setting it via gpu_env causes 'GTL library is not linked' errors.
Only frontier.mako (CCE) needs this variable.
@sbryngelson
Copy link
Member Author

@danieljvickers @wilfonba @anandrdbz @ChrisZYJ

Curious about your opinion on this PR. All new goldenfiles. Tests are much faster now. 3D tests made somewhat smaller, 1/2 the time steps, and direct MPI execution rather than overhead from hundreds of calls to ./mfc.sh run.

Cuts the test suite cost by ~3x.

@wilfonba
Copy link
Contributor

@danieljvickers @wilfonba @anandrdbz @ChrisZYJ

Curious about your opinion on this PR. All new goldenfiles. Tests are much faster now. 3D tests made somewhat smaller, 1/2 the time steps, and direct MPI execution rather than overhead from hundreds of calls to ./mfc.sh run.

Cuts the test suite cost by ~3x.

How much impact does doing only the direct MPI execution, without changing all the golden files (so keeping the time steps the same), have? If the time-steps are reduced some of the tolerances may also need to be reduced to keep the same sensitivity to bugs.

@sbryngelson
Copy link
Member Author

@danieljvickers @wilfonba @anandrdbz @ChrisZYJ
Curious about your opinion on this PR. All new goldenfiles. Tests are much faster now. 3D tests made somewhat smaller, 1/2 the time steps, and direct MPI execution rather than overhead from hundreds of calls to ./mfc.sh run.
Cuts the test suite cost by ~3x.

How much impact does doing only the direct MPI execution, without changing all the golden files (so keeping the time steps the same), have? If the time-steps are reduced some of the tolerances may also need to be reduced to keep the same sensitivity to bugs.

Yeah i noticed this too, but just reducing the overhead is good for 1D cases, but the 3D ones are dominated by runtime not startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants