From 20ad93e844b8fd3010ccdc35c690aecda45ef920 Mon Sep 17 00:00:00 2001 From: Rekha Thottan Date: Mon, 23 Mar 2026 07:56:31 -0700 Subject: [PATCH 1/3] docs: Update Agent Health docs with dashboard details and unhide trace visualization - Unhide trace visualization page from sidebar - Add dashboard overview description with leaderboard details - Expand experiment reports section with report contents - Add section linking Agent Health to Observability Stack --- .../content/docs/agent-health/evaluations/experiments.md | 6 ++++-- .../src/content/docs/agent-health/getting-started.md | 9 +++++---- .../src/content/docs/agent-health/index.md | 6 ++++++ 3 files changed, 15 insertions(+), 6 deletions(-) diff --git a/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md b/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md index 5c37ceb5..09ec3cf5 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md +++ b/docs/starlight-docs/src/content/docs/agent-health/evaluations/experiments.md @@ -54,7 +54,7 @@ Create experiments with multiple runs using different agents or models. The UI p ## Generating reports -Generate downloadable reports from experiment results: +Generate downloadable reports from experiment results for sharing or tracking progress over time. ```bash # HTML report (default) @@ -67,4 +67,6 @@ npx @opensearch-project/agent-health report -b "My Benchmark" -f pdf -o report.p npx @opensearch-project/agent-health report -b "My Benchmark" -f json --stdout ``` -Reports include judge reasoning, accuracy scores, and improvement suggestions for each test case. +Reports include a summary of each run (agent, model, pass rate, average accuracy), a per-test-case comparison table across runs, the judge's reasoning and improvement suggestions for each evaluation, and full trajectory steps showing what the agent did. + +Use `--runs` to include specific runs, or omit it to include all runs in the experiment. diff --git a/docs/starlight-docs/src/content/docs/agent-health/getting-started.md b/docs/starlight-docs/src/content/docs/agent-health/getting-started.md index 05999ad7..675715fc 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/getting-started.md +++ b/docs/starlight-docs/src/content/docs/agent-health/getting-started.md @@ -76,10 +76,11 @@ Sample data IDs start with `demo-` prefix and are read-only. ![Agent Health Dashboard](/docs/images/agent-health/dashboard.png) -The main dashboard displays: -- Active experiments and their status -- Recent evaluation runs -- Quick statistics on pass/fail rates +Agent Health opens to the Leaderboard Overview — an at-a-glance view of agent performance across all experiments, pre-loaded with sample data. + +The top section shows a performance trend chart tracking metrics over time. Use the dropdowns to switch between pass rate, cost, tokens, or latency, and adjust the time range (7 days, 30 days, or all time). + +The bottom section is a sortable metrics table showing every experiment × agent combination with columns for run count, pass rate, latency, and cost. Click any column header to sort. Click an experiment or agent name to filter the trend chart to just that selection — active filters appear which can be cleared as required. Each row links to the experiment’s detailed runs view. ## Run your first evaluation diff --git a/docs/starlight-docs/src/content/docs/agent-health/index.md b/docs/starlight-docs/src/content/docs/agent-health/index.md index 9fb2bbf4..c56811b7 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/index.md +++ b/docs/starlight-docs/src/content/docs/agent-health/index.md @@ -36,6 +36,12 @@ Opens http://localhost:4001 with pre-loaded sample data for exploration. Agent Health uses a client-server architecture where all clients (UI, CLI) access storage through a unified HTTP API. The server handles agent communication via pluggable connectors and proxies LLM judge calls to AWS Bedrock. +## Agent Health and the Observability Stack + +Agent Health is a UI and CLI-based evaluation tool for scoring agent quality through LLM judge comparison, running experiments, and generating reports. By default it stores test cases, experiments, runs, and evaluation results as local files. + +When pointed at an OpenSearch cluster, including the one running in the [Observability Stack](/docs/get-started/overview/), Agent Health stores test cases, experiments, runs, and evaluation results in OpenSearch indices instead of local files. If the same cluster is receiving OpenTelemetry traces through the stack pipeline, Agent Health can also read those traces and display them alongside evaluation results, connecting what the agent did with how well it performed. + ## Supported connectors | Connector | Protocol | Description | From dfb2763a6f072324d76e33ff95bbb047a0dd2b5d Mon Sep 17 00:00:00 2001 From: Rekha Thottan Date: Mon, 23 Mar 2026 08:05:06 -0700 Subject: [PATCH 2/3] fix --- .../src/content/docs/agent-health/getting-started.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/starlight-docs/src/content/docs/agent-health/getting-started.md b/docs/starlight-docs/src/content/docs/agent-health/getting-started.md index 675715fc..80e7ba6f 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/getting-started.md +++ b/docs/starlight-docs/src/content/docs/agent-health/getting-started.md @@ -80,7 +80,7 @@ Agent Health opens to the Leaderboard Overview — an at-a-glance view of agent The top section shows a performance trend chart tracking metrics over time. Use the dropdowns to switch between pass rate, cost, tokens, or latency, and adjust the time range (7 days, 30 days, or all time). -The bottom section is a sortable metrics table showing every experiment × agent combination with columns for run count, pass rate, latency, and cost. Click any column header to sort. Click an experiment or agent name to filter the trend chart to just that selection — active filters appear which can be cleared as required. Each row links to the experiment’s detailed runs view. +The bottom section is a sortable metrics table showing every experiment and agent combination with columns for run count, pass rate, latency, and cost. Click any column header to sort. Click an experiment or agent name to filter the trend chart to just that selection — active filters appear which can be cleared as required. Each row links to the experiment’s detailed runs view. ## Run your first evaluation From 7346063cf639314b2504c499bbc12846666d4288 Mon Sep 17 00:00:00 2001 From: Rekha Thottan Date: Mon, 23 Mar 2026 09:31:59 -0700 Subject: [PATCH 3/3] docs: Add see it in action video section to Agent Health overview --- docs/starlight-docs/src/content/docs/agent-health/index.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/starlight-docs/src/content/docs/agent-health/index.md b/docs/starlight-docs/src/content/docs/agent-health/index.md index c56811b7..3d53ea09 100644 --- a/docs/starlight-docs/src/content/docs/agent-health/index.md +++ b/docs/starlight-docs/src/content/docs/agent-health/index.md @@ -16,6 +16,10 @@ npx @opensearch-project/agent-health@latest Opens http://localhost:4001 with pre-loaded sample data for exploration. +## See it in action + + + ## Who uses Agent Health - **AI teams** building autonomous agents (RCA, customer support, data analysis)