A multi-agent environment (PettingZoo/Gym) for hospital lab automation
| Pillar | Goal |
|---|---|
| Environment | Pip-installable, standard multi-agent API (PettingZoo AEC or parallel). |
| Trust skeleton | Roles/permissions, signed actions, hash-chained audit log, invariants, reason codes. |
| Benchmarks | Tasks (throughput_sla, adversarial_disruption, insider_key_misuse, coord_scale, coord_risk) and baselines (scripted, MARL, LLM). The golden suite defines correctness; regression means passing the suite. Safety/throughput trade-offs are measurable. |
| Coordination | Pluggable coordination methods; coord_scale (scale stress) and coord_risk (under injection). Method–risk matrix and coordination security pack with gate thresholds; SOTA and method-class comparison. |
| Security & safety | Security attack suite (prompt injection, tool, memory, detector, coordination-under-attack); risk register bundle with evidence and gaps; coverage gate (required_bench); safety case. Evidence bundles and verify-release chain for auditability. |
Principles
- Golden scenarios drive development — Correctness is defined as passing the golden suite; the suite is the specification for regression. It does not cover all failure modes; gaps imply gaps in assured behavior.
- Policy is data — Invariants, tokens, reason codes, catalogue, zones live in versioned files under
policy/. - No silent failure — Missing hooks or invalid data fail loudly with reason codes.
- Evidence over claims — Security and safety are evidenced by the attack suite, coordination security pack, and risk register; required_bench cells must be covered or explicitly waived.
System and threat model: Systems and threat model.
Limitation — Passing all sim tests and gates does not imply production safety. Production adds distribution shift, real adversaries, key/ops failures, and environment drift. Use sim for development and regression; production assurance is the integrator's responsibility.
| I want to... | First step |
|---|---|
| Run benchmarks only | pip install labtrust-gym[env,plots] then labtrust quick-eval |
| Add my coordination method (or task) | Extension development + entry_points; see examples/extension_example |
| Fork and customize policy | Forker guide and labtrust forker-quickstart |
| Use as a library without forking | Extension development + --profile + extension_packages in a lab profile |
| Run the full security suite | labtrust run-security-suite; needs .[env]; use --skip-system-level when env is not installed |
Stable surface for extensions: Public API.
From PyPI (env + plots for benchmarks and quick-eval)
pip install labtrust-gym[env,plots]
labtrust --version
labtrust quick-evalRuns one episode each of throughput_sla, adversarial_disruption, and multi_site_stat with scripted baselines; summary and logs under ./labtrust_runs/.
From source (development)
git clone https://github.com/fraware/LabTrust-Gym.git
cd LabTrust-Gym
pip install -e ".[dev]"
labtrust validate-policy
pytest -qFull stack (benchmarks, studies, plots)
pip install -e ".[dev,env,plots]"
labtrust run-benchmark --task throughput_sla --episodes 5 --out results.json
labtrust reproduce --profile minimalNew to the repo? Forker guide and Quick demos for customizing and running commands end-to-end.
Extending without forking
- Option A — Fork and customize via partner overlay and policy. Forker guide.
- Option B — Install
labtrust-gymand ship your own pip package (domains, tasks, coordination methods, etc. viaregister_*or entry_points;--profileandextension_packages). Extension development – Option B.
Optional extras
| Extra | Purpose |
|---|---|
[env] |
PettingZoo/Gymnasium (benchmarks and full security suite including coord_pack_ref) |
[plots] |
Matplotlib and Pillow (study figures, data tables) |
[llm_openai] |
OpenAI live backend (openai_live) |
[llm_anthropic] |
Anthropic live backend (anthropic_live) |
[marl] |
Stable-Baselines3 (PPO train/eval) |
[marl_hpo] |
Optuna (HPO for PPO) |
[docs] |
MkDocs + mkdocstrings |
Full security suite (including coord_pack_ref) requires [env]; use --skip-system-level when env is not installed.
Benchmarks run in one of three modes: deterministic | llm_offline | llm_live (Live LLM). Defaults are offline (no network, no API cost).
flowchart LR
Run["Run benchmark"]
Run --> D["deterministic (default)"]
Run --> O["llm_offline"]
Run --> L["llm_live + --allow-network"]
D --> NoNet["No network"]
O --> NoNet
L --> Net["Network / API"]
| Mode | Network | Agents | Use case |
|---|---|---|---|
| deterministic | No | Scripted only | CI, regression, reproduce, paper artifact (default) |
| llm_offline | No | LLM interface, deterministic backend only | Offline LLM evaluation, no API calls |
| llm_live | Yes (opt-in) | Live OpenAI/Ollama | Interactive or cost-accepting runs; requires --allow-network |
Set mode with --pipeline-mode; for live LLM add --allow-network or LABTRUST_ALLOW_NETWORK=1.
labtrust quick-evalOutput: markdown summary (throughput, violations, blocked counts) and logs under ./labtrust_runs/quick_eval_<timestamp>/. Use --seed and --out-dir to customize.
Canonical demos: labtrust forker-quickstart, labtrust quick-eval, labtrust run-summary --run <dir>, labtrust run-official-pack (add --include-coordination-pack for coordination and security evidence). Quick demos lists "if you want to see X, run Y."
Example agents: Example experiments; agents and configs in examples/. Optional notebook examples/quick_eval.ipynb (requires .[env,plots]). External agent:
labtrust eval-agent --agent 'examples.external_agent_demo:SafeNoOpAgent' --task throughput_sla --episodes 2 --out out.jsonPut CLI outputs in labtrust_runs/ or --out. Exit codes, minimal smoke args, and output paths: CLI output contract. Commands are smoke-tested in tests/test_cli_smoke_matrix.py.
| Command | Description |
|---|---|
| validate-policy | Validate policy YAML/JSON. --domain <domain_id> merges base + policy/domains/<domain_id>/; --partner <id> for overlay. |
| forker-quickstart | One-command forker: validate-policy, coordination pack, lab report, risk register export. Forker guide. |
| Command | Description |
|---|---|
| quick-eval | One episode each of throughput_sla, adversarial_disruption, multi_site_stat; summary + logs under ./labtrust_runs/. |
| run-benchmark | Run tasks (throughput_sla, stat_insertion, qc_cascade, adversarial_disruption, multi_site_stat, insider_key_misuse, coord_scale, coord_risk). Requires --task, --out. Options: --episodes, --seed, --coord-method, --injection, --scale, --timing, --llm-backend, --llm-agents, --always-step-timing, --approval-hook. Agent-centric: --agent-driven, --multi-agentic; optional --use-parallel-multi-agentic. Live LLM, Scale limits. |
| run-summary | One-line stats for a run dir. --run <dir>, --format json. |
| eval-agent | Benchmark with external agent (e.g. examples.external_agent_demo:SafeNoOpAgent or PPO via LABTRUST_PPO_MODEL and labtrust_gym.baselines.marl.ppo_agent:PPOAgent). |
| bench-smoke | One episode per task (throughput_sla, stat_insertion, qc_cascade). |
| determinism-report | Run twice; assert v0.2 metrics and episode log hash. Requires --task, --episodes, --seed, --out. |
| train-ppo, eval-ppo | PPO train/eval (.[marl]). Writes train_config.json. Optional HPO: .[marl_hpo]. MARL baselines. |
| Command | Description |
|---|---|
| export-receipts | Receipt.v0.1 and EvidenceBundle.v0.1 from episode log. |
| export-fhir | HL7 FHIR R4 Bundle from receipts (data-absent-reason, no placeholder IDs). FHIR export. |
| validate-fhir | Validate bundle codes: --bundle <path> --terminology <value_set_json> [--strict]. FHIR export. |
| verify-bundle | Verify one EvidenceBundle.v0.1. --strict-fingerprints for coordination, memory, rbac, tool_registry. |
| verify-release | Verify release: EvidenceBundles, risk register, RELEASE_MANIFEST hashes. --strict-fingerprints for releases. Trust verification. |
| build-release-manifest | Write RELEASE_MANIFEST.v0.1.json into --release-dir. Run after export-risk-register; then verify-release. |
| ui-export | UI-ready zip (index, events, receipts_index, reason_codes). UI data contract. |
| Command | Description |
|---|---|
| run-security-suite | Smoke/full; SECURITY/attack_results.json. Options: `--agent-driven-mode single |
| safety-case | Generate SAFETY_CASE/. Risk register. |
| run-official-pack | Official pack (baselines, coordination, security, safety, transparency). --out <dir>, --seed-base, --include-coordination-pack for coordination_pack/ and lab report. Official benchmark pack. |
| Command | Description |
|---|---|
| export-risk-register | RiskRegisterBundle.v0.1 to --out; --runs (repeatable) for evidence dirs. Gaps as first-class. Risk register. |
| build-risk-register-bundle | Same bundle to explicit path. |
| validate-coverage | Required_bench evidenced or waived. --strict to fail on missing. |
| Command | Description |
|---|---|
| run-coordination-study | Scale x method x injection; summary_coord.csv, pareto.md, SOTA leaderboard. Coordination studies. |
| run-coordination-security-pack | Regression pack. --out, --matrix-preset (hospital_lab, hospital_lab_full, full_matrix, exploratory_*). pack_results/, pack_summary.csv, pack_gate.md. Security attack suite. |
| summarize-coordination | SOTA leaderboard, method-class comparison. |
| recommend-coordination-method | COORDINATION_DECISION.v0.1.json from run dir. |
| build-coordination-matrix | CoordinationMatrix v0.1 from llm_live run. |
| run-study | Study from spec (--spec, --out). |
| make-plots | Figures and data tables from study run. |
| Command | Description |
|---|---|
| reproduce | Minimal/full results + figures (`--profile minimal |
| package-release | Release artifact: receipts, FHIR, MANIFEST, BENCHMARK_CARD. --profile paper_v0.1 for paper-ready. Paper provenance. |
| generate-official-baselines | Core tasks with official baselines. Registry: benchmarks/baseline_registry.v0.1.yaml. |
| summarize-results | summary_v0.2.csv, summary_v0.3.csv, summary.md (bounded memory). Metrics contract. |
| serve | HTTP server (auth, rate limits). Security controls. |
| Path | Description |
|---|---|
| policy/ | YAML/JSON: schemas, emits, invariants, tokens, reason_codes, zones, catalogue, coordination, golden, official, llm, partners, risks (risk_registry, waivers, required_bench_plan.v0.1). labtrust validate-policy. |
| src/labtrust_gym/ | Package: config, engine/, envs/, baselines/, benchmarks/, policy/, security/, studies/, export/, online/, runner/, cli/. |
| tests/ | Pytest: golden suite, policy, benchmarks, coordination, risk_injections, studies, export, online, CLI smoke (test_cli_smoke_matrix.py). |
| benchmarks/ | Baseline registry, official baselines (v0.1, v0.2). |
| examples/ | Example agents (external_agent_demo, scripted_ops_agent, llm_agent_mock_demo, etc.). |
| docs/ | MkDocs: architecture, benchmarks, coordination, contracts, getting started, security, LLM, MARL. Forker guide. docs/assets/ — repo logo (Logo.png). |
| scripts/ | run_hospital_lab_full_pipeline.py (orchestrator; --include-coordination-pack, --providers), check_llm_backends_live.py, quickstart, run_required_bench_matrix, extract_paper_claims_snapshot, build_release_fixture, build_viewer_data_from_release, run_external_reviewer_checks. |
| tests/fixtures/ui_fixtures/ | Minimal results, episode log, evidence bundle for offline UI. |
Cite using CITATION.cff.
| Action | Command / reference |
|---|---|
| Reproduce | labtrust reproduce --profile minimal — Reproduce. |
| Release artifact | labtrust package-release --profile minimal --out /tmp/labtrust_release. Paper-ready: --profile paper_v0.1 — Paper provenance. |
| Research and audit | Paper-ready artifact + verify-release — Quick demos, Paper provenance. |
| Standardized evaluation | Benchmark card, official baselines v0.2 — Use cases and impact. |
| Official baselines | v0.2 in benchmarks/baselines_official/v0.2/. Regenerate: labtrust generate-official-baselines --out benchmarks/baselines_official/v0.2/ --episodes 3 --seed 123 --force. Compare: labtrust summarize-results --in benchmarks/baselines_official/v0.2/results/ your_results.json --out /tmp/compare. |
| Cite | CITATION.cff or LabTrust-Gym: a multi-agent environment for hospital lab automation (pathology lab / blood sciences) with a trust skeleton. https://github.com/fraware/LabTrust-Gym. |
Apache-2.0.
