LabTrust-Gym

A multi-agent environment (PettingZoo/Gym) for hospital lab automation

What is LabTrust-Gym?

Pillar	Goal
Environment	Pip-installable, standard multi-agent API (PettingZoo AEC or parallel).
Trust skeleton	Roles/permissions, signed actions, hash-chained audit log, invariants, reason codes.
Benchmarks	Tasks (throughput_sla, adversarial_disruption, insider_key_misuse, coord_scale, coord_risk) and baselines (scripted, MARL, LLM). The golden suite defines correctness; regression means passing the suite. Safety/throughput trade-offs are measurable.
Coordination	Pluggable coordination methods; coord_scale (scale stress) and coord_risk (under injection). Method–risk matrix and coordination security pack with gate thresholds; SOTA and method-class comparison.
Security & safety	Security attack suite (prompt injection, tool, memory, detector, coordination-under-attack); risk register bundle with evidence and gaps; coverage gate (required_bench); safety case. Evidence bundles and verify-release chain for auditability.

Principles

Golden scenarios drive development — Correctness is defined as passing the golden suite; the suite is the specification for regression. It does not cover all failure modes; gaps imply gaps in assured behavior.
Policy is data — Invariants, tokens, reason codes, catalogue, zones live in versioned files under policy/.
No silent failure — Missing hooks or invalid data fail loudly with reason codes.
Evidence over claims — Security and safety are evidenced by the attack suite, coordination security pack, and risk register; required_bench cells must be covered or explicitly waived.

System and threat model: Systems and threat model.

Limitation — Passing all sim tests and gates does not imply production safety. Production adds distribution shift, real adversaries, key/ops failures, and environment drift. Use sim for development and regression; production assurance is the integrator's responsibility.

Who is this for? / I want to...

I want to...	First step
Run benchmarks only	`pip install labtrust-gym[env,plots]` then `labtrust quick-eval`
Add my coordination method (or task)	Extension development + entry_points; see examples/extension_example
Fork and customize policy	Forker guide and `labtrust forker-quickstart`
Use as a library without forking	Extension development + `--profile` + `extension_packages` in a lab profile
Run the full security suite	`labtrust run-security-suite`; needs `.[env]`; use `--skip-system-level` when env is not installed

Stable surface for extensions: Public API.

Installation (pip)

From PyPI (env + plots for benchmarks and quick-eval)

pip install labtrust-gym[env,plots]
labtrust --version
labtrust quick-eval

Runs one episode each of throughput_sla, adversarial_disruption, and multi_site_stat with scripted baselines; summary and logs under ./labtrust_runs/.

From source (development)

git clone https://github.com/fraware/LabTrust-Gym.git
cd LabTrust-Gym
pip install -e ".[dev]"
labtrust validate-policy
pytest -q

Full stack (benchmarks, studies, plots)

pip install -e ".[dev,env,plots]"
labtrust run-benchmark --task throughput_sla --episodes 5 --out results.json
labtrust reproduce --profile minimal

New to the repo? Forker guide and Quick demos for customizing and running commands end-to-end.

Extending without forking

Option A — Fork and customize via partner overlay and policy. Forker guide.
Option B — Install labtrust-gym and ship your own pip package (domains, tasks, coordination methods, etc. via register_* or entry_points; --profile and extension_packages). Extension development – Option B.

Optional extras

Extra	Purpose
`[env]`	PettingZoo/Gymnasium (benchmarks and full security suite including coord_pack_ref)
`[plots]`	Matplotlib and Pillow (study figures, data tables)
`[llm_openai]`	OpenAI live backend (openai_live)
`[llm_anthropic]`	Anthropic live backend (anthropic_live)
`[marl]`	Stable-Baselines3 (PPO train/eval)
`[marl_hpo]`	Optuna (HPO for PPO)
`[docs]`	MkDocs + mkdocstrings

Full security suite (including coord_pack_ref) requires [env]; use --skip-system-level when env is not installed.

Pipelines

Benchmarks run in one of three modes: deterministic | llm_offline | llm_live (Live LLM). Defaults are offline (no network, no API cost).

flowchart LR
    Run["Run benchmark"]
    Run --> D["deterministic (default)"]
    Run --> O["llm_offline"]
    Run --> L["llm_live + --allow-network"]
    D --> NoNet["No network"]
    O --> NoNet
    L --> Net["Network / API"]

Mode	Network	Agents	Use case
deterministic	No	Scripted only	CI, regression, reproduce, paper artifact (default)
llm_offline	No	LLM interface, deterministic backend only	Offline LLM evaluation, no API calls
llm_live	Yes (opt-in)	Live OpenAI/Ollama	Interactive or cost-accepting runs; requires `--allow-network`

Set mode with --pipeline-mode; for live LLM add --allow-network or LABTRUST_ALLOW_NETWORK=1.

Quick eval

labtrust quick-eval

Output: markdown summary (throughput, violations, blocked counts) and logs under ./labtrust_runs/quick_eval_<timestamp>/. Use --seed and --out-dir to customize.

Canonical demos: labtrust forker-quickstart, labtrust quick-eval, labtrust run-summary --run <dir>, labtrust run-official-pack (add --include-coordination-pack for coordination and security evidence). Quick demos lists "if you want to see X, run Y."

Example agents: Example experiments; agents and configs in examples/. Optional notebook examples/quick_eval.ipynb (requires .[env,plots]). External agent:

labtrust eval-agent --agent 'examples.external_agent_demo:SafeNoOpAgent' --task throughput_sla --episodes 2 --out out.json

CLI

Put CLI outputs in labtrust_runs/ or --out. Exit codes, minimal smoke args, and output paths: CLI output contract. Commands are smoke-tested in tests/test_cli_smoke_matrix.py.

Policy and validation

Command	Description
validate-policy	Validate policy YAML/JSON. `--domain <domain_id>` merges base + `policy/domains/<domain_id>/`; `--partner <id>` for overlay.
forker-quickstart	One-command forker: validate-policy, coordination pack, lab report, risk register export. Forker guide.

Benchmarking and evaluation

Command	Description
quick-eval	One episode each of throughput_sla, adversarial_disruption, multi_site_stat; summary + logs under `./labtrust_runs/`.
run-benchmark	Run tasks (throughput_sla, stat_insertion, qc_cascade, adversarial_disruption, multi_site_stat, insider_key_misuse, coord_scale, coord_risk). Requires `--task`, `--out`. Options: `--episodes`, `--seed`, `--coord-method`, `--injection`, `--scale`, `--timing`, `--llm-backend`, `--llm-agents`, `--always-step-timing`, `--approval-hook`. Agent-centric: `--agent-driven`, `--multi-agentic`; optional `--use-parallel-multi-agentic`. Live LLM, Scale limits.
run-summary	One-line stats for a run dir. `--run <dir>`, `--format json`.
eval-agent	Benchmark with external agent (e.g. `examples.external_agent_demo:SafeNoOpAgent` or PPO via `LABTRUST_PPO_MODEL` and `labtrust_gym.baselines.marl.ppo_agent:PPOAgent`).
bench-smoke	One episode per task (throughput_sla, stat_insertion, qc_cascade).
determinism-report	Run twice; assert v0.2 metrics and episode log hash. Requires `--task`, `--episodes`, `--seed`, `--out`.
train-ppo, eval-ppo	PPO train/eval (`.[marl]`). Writes `train_config.json`. Optional HPO: `.[marl_hpo]`. MARL baselines.

Export and verification

Command	Description
export-receipts	Receipt.v0.1 and EvidenceBundle.v0.1 from episode log.
export-fhir	HL7 FHIR R4 Bundle from receipts (data-absent-reason, no placeholder IDs). FHIR export.
validate-fhir	Validate bundle codes: `--bundle <path> --terminology <value_set_json>` [--strict]. FHIR export.
verify-bundle	Verify one EvidenceBundle.v0.1. `--strict-fingerprints` for coordination, memory, rbac, tool_registry.
verify-release	Verify release: EvidenceBundles, risk register, RELEASE_MANIFEST hashes. `--strict-fingerprints` for releases. Trust verification.
build-release-manifest	Write RELEASE_MANIFEST.v0.1.json into `--release-dir`. Run after export-risk-register; then verify-release.
ui-export	UI-ready zip (index, events, receipts_index, reason_codes). UI data contract.

Security and safety

Command	Description
run-security-suite	Smoke/full; SECURITY/attack_results.json. Options: `--agent-driven-mode single
safety-case	Generate SAFETY_CASE/. Risk register.
run-official-pack	Official pack (baselines, coordination, security, safety, transparency). `--out <dir>`, `--seed-base`, `--include-coordination-pack` for coordination_pack/ and lab report. Official benchmark pack.

Risk register

Command	Description
export-risk-register	RiskRegisterBundle.v0.1 to `--out`; `--runs` (repeatable) for evidence dirs. Gaps as first-class. Risk register.
build-risk-register-bundle	Same bundle to explicit path.
validate-coverage	Required_bench evidenced or waived. `--strict` to fail on missing.

Coordination and studies

Command	Description
run-coordination-study	Scale x method x injection; summary_coord.csv, pareto.md, SOTA leaderboard. Coordination studies.
run-coordination-security-pack	Regression pack. `--out`, `--matrix-preset` (hospital_lab, hospital_lab_full, full_matrix, exploratory_*). pack_results/, pack_summary.csv, pack_gate.md. Security attack suite.
summarize-coordination	SOTA leaderboard, method-class comparison.
recommend-coordination-method	COORDINATION_DECISION.v0.1.json from run dir.
build-coordination-matrix	CoordinationMatrix v0.1 from llm_live run.
run-study	Study from spec (`--spec`, `--out`).
make-plots	Figures and data tables from study run.

Release and reproducibility

Command	Description
reproduce	Minimal/full results + figures (`--profile minimal
package-release	Release artifact: receipts, FHIR, MANIFEST, BENCHMARK_CARD. `--profile paper_v0.1` for paper-ready. Paper provenance.
generate-official-baselines	Core tasks with official baselines. Registry: `benchmarks/baseline_registry.v0.1.yaml`.
summarize-results	summary_v0.2.csv, summary_v0.3.csv, summary.md (bounded memory). Metrics contract.
serve	HTTP server (auth, rate limits). Security controls.

Repository structure

Path	Description
policy/	YAML/JSON: schemas, emits, invariants, tokens, reason_codes, zones, catalogue, coordination, golden, official, llm, partners, risks (risk_registry, waivers, required_bench_plan.v0.1). `labtrust validate-policy`.
src/labtrust_gym/	Package: config, engine/, envs/, baselines/, benchmarks/, policy/, security/, studies/, export/, online/, runner/, cli/.
tests/	Pytest: golden suite, policy, benchmarks, coordination, risk_injections, studies, export, online, CLI smoke (`test_cli_smoke_matrix.py`).
benchmarks/	Baseline registry, official baselines (v0.1, v0.2).
examples/	Example agents (external_agent_demo, scripted_ops_agent, llm_agent_mock_demo, etc.).
docs/	MkDocs: architecture, benchmarks, coordination, contracts, getting started, security, LLM, MARL. Forker guide. docs/assets/ — repo logo (`Logo.png`).
scripts/	run_hospital_lab_full_pipeline.py (orchestrator; `--include-coordination-pack`, `--providers`), check_llm_backends_live.py, quickstart, run_required_bench_matrix, extract_paper_claims_snapshot, build_release_fixture, build_viewer_data_from_release, run_external_reviewer_checks.
tests/fixtures/ui_fixtures/	Minimal results, episode log, evidence bundle for offline UI.

Reproducibility and citation

Cite using CITATION.cff.

Action	Command / reference
Reproduce	`labtrust reproduce --profile minimal` — Reproduce.
Release artifact	`labtrust package-release --profile minimal --out /tmp/labtrust_release`. Paper-ready: `--profile paper_v0.1` — Paper provenance.
Research and audit	Paper-ready artifact + verify-release — Quick demos, Paper provenance.
Standardized evaluation	Benchmark card, official baselines v0.2 — Use cases and impact.
Official baselines	v0.2 in `benchmarks/baselines_official/v0.2/`. Regenerate: `labtrust generate-official-baselines --out benchmarks/baselines_official/v0.2/ --episodes 3 --seed 123 --force`. Compare: `labtrust summarize-results --in benchmarks/baselines_official/v0.2/results/ your_results.json --out /tmp/compare`.
Cite	CITATION.cff or LabTrust-Gym: a multi-agent environment for hospital lab automation (pathology lab / blood sciences) with a trust skeleton. https://github.com/fraware/LabTrust-Gym.

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
.llm_config		.llm_config
benchmarks		benchmarks
docs		docs
examples		examples
policy		policy
risk_register_out		risk_register_out
scripts		scripts
src/labtrust_gym		src/labtrust_gym
tests		tests
tools		tools
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LabTrust-Gym

What is LabTrust-Gym?

Who is this for? / I want to...

Installation (pip)

Pipelines

Quick eval

CLI

Policy and validation

Benchmarking and evaluation

Export and verification

Security and safety

Risk register

Coordination and studies

Release and reproducibility

Repository structure

Reproducibility and citation

License

About

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LabTrust-Gym

What is LabTrust-Gym?

Who is this for? / I want to...

Installation (pip)

Pipelines

Quick eval

CLI

Policy and validation

Benchmarking and evaluation

Export and verification

Security and safety

Risk register

Coordination and studies

Release and reproducibility

Repository structure

Reproducibility and citation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages