Agent Systems Evaluation: Monolithic vs Ensemble

Empirical comparison of two agent architectures for document synthesis:

Monolithic Agent: Single LLM approach (fast, simple)
Ensemble Agent: Multi-agent system with recursive orchestration (higher quality, iterative refinement)

Evaluation uses MLflow tracking, LLM-as-a-judge scoring, and NLP metrics (BERTScore, ROUGE).

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Defaults are configured for local Ollama

3. Pull Model

ollama pull qwen2.5:7b

4. Run Test Evaluation

python evaluate.py --test
# Fast test with 1 paper, completes in ~5-10 minutes
# Use --agents-model=gemini to test with Google Gemini instead of Ollama

5. View Results

mlflow ui
# Open http://localhost:5000

Usage

Test Mode (Recommended First Run)

python evaluate.py --test
# Or with Gemini:
python evaluate.py --test --agents-model=gemini

Processes 1 paper, 1 task
Completes in ~5-10 minutes
Verifies setup works

Full Evaluation

python evaluate.py
# Or with Gemini:
python evaluate.py --agents-model=gemini

Processes all 10 papers, 3 tasks
Takes 1-2 hours
Generates complete comparison

Model Selection

# Use local Ollama (default, free)
python evaluate.py --agents-model=ollama

# Use Google Gemini (requires API key)
python evaluate.py --agents-model=gemini

Judges always use Gemini for consistency
MLflow experiments get _gemini suffix when using Gemini agents
See CLI_USAGE.md for detailed usage guide

Resume from Cache

If interrupted, simply rerun the command. Already-processed documents load from data/cache/ instantly.

Clear Cache

rm -rf data/cache/summaries/* data/cache/ensemble_summaries/*

Project Structure

agent-systems-eval/
├── README.md                   # Public documentation
├── MASTER_README.md           # Comprehensive private docs
├── monolithic.py              # Single LLM agent
├── ensemble.py                # Multi-agent ensemble (CrewAI Flows)
├── evaluate.py                # MLflow evaluation framework
├── utils.py                   # Shared utilities
├── rate_limits.py            # API rate limiter
├── llm/                       # LLM client abstraction
│   ├── ollama.py             # Ollama implementation
│   ├── gemini.py             # Gemini implementation
│   └── factory.py            # Client factory
├── data/
│   ├── source_documents/     # PDF inputs
│   ├── tasks/                # Task definitions
│   └── cache/                # Cached summaries
└── mlruns/                   # MLflow tracking data

Key Features

Two Agent Architectures: Compare monolithic vs multi-agent approaches
Recursive Orchestration: Ensemble uses CrewAI Flows for iterative refinement
MLflow Tracking: Complete experiment management and comparison
LLM-as-a-Judge: Automated quality evaluation (groundedness, adherence, completeness)
NLP Metrics: BERTScore and ROUGE for quantitative analysis
Map-Reduce Processing: Efficient handling of large documents with caching
PDF Support: Processes academic papers directly

Expected Results

Monolithic Agent:

✅ Fast (~5-15 seconds per task)
✅ Low token usage
✅ Good for straightforward synthesis

Ensemble Agent:

✅ Higher quality scores (~15-25% improvement)
✅ Adaptive iteration (orchestrator decides when ready)
✅ Full iteration history logged
⚠️ Slower (~2-3x latency)
⚠️ Higher token usage (minimal cost with local Ollama)

Configuration

CLI Arguments

--agents-model {ollama,gemini}: Choose model provider for agents (default: ollama)
-t, --test: Run in test mode (1 paper, 1 task)

Run python evaluate.py --help for all options.

Environment Variables

Environment variables (.env):

Required:

OLLAMA_MODEL: Model name for Ollama agents (default: qwen2.5:7b)
OLLAMA_NUM_CTX: Context window (default: 32768)
CREWAI_MODEL: Model for Ensemble when using Ollama (default: openai/qwen2.5:7b)
JUDGE_MODEL: MLflow judge - always Gemini (default: gemini:/gemini-2.5-flash-lite)

Required for Gemini agents (--agents-model=gemini):

GEMINI_API_KEY: Your Gemini API key

Note: When using --agents-model=gemini, agents use gemini-2.5-flash-lite and gemini/gemini-2.5-flash-lite (hardcoded), overriding OLLAMA_MODEL and CREWAI_MODEL.

See .env.example for full configuration.

Adding Custom Tasks

Add PDF/text files to data/source_documents/
Edit data/tasks/synthesis_tasks.json:

{
  "task_id": "custom_1",
  "task_description": "Your task description...",
  "expected_elements": ["Element 1", "Element 2"]
}

Run python evaluate.py

Documentation

README.md (this file): Quick start and essential usage
CLI_USAGE.md: Command-line interface guide with all options and examples
MASTER_README.md: Comprehensive private documentation with:
- Detailed architecture and implementation notes
- Troubleshooting and debugging guides
- Performance optimization tips
- Technical deep dives

Requirements

Python 3.10+
Ollama (for local inference)
See requirements.txt for Python dependencies

Optional: Google Gemini API key (if using LLM_PROVIDER=gemini)

License

MIT License

Contributing

Contributions welcome! Please submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
data		data
llm		llm
.env.example		.env.example
.gitignore		.gitignore
CLI_USAGE.md		CLI_USAGE.md
EVALUATION_FIXES_README.md		EVALUATION_FIXES_README.md
FAIRNESS_ANALYSIS.md		FAIRNESS_ANALYSIS.md
MASTER_README.md		MASTER_README.md
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
TOKEN_USAGE_EXPLANATION.md		TOKEN_USAGE_EXPLANATION.md
check_crewai_llm.py		check_crewai_llm.py
ensemble.py		ensemble.py
evaluate.py		evaluate.py
evaluation_analysis.md		evaluation_analysis.md
example_usage.py		example_usage.py
monolithic.py		monolithic.py
rate_limits.py		rate_limits.py
requirements.txt		requirements.txt
test_cache.py		test_cache.py
test_cli_args.py		test_cli_args.py
test_context_window.py		test_context_window.py
test_ensemble_only.py		test_ensemble_only.py
test_judge_fixes.py		test_judge_fixes.py
test_minimal.py		test_minimal.py
test_ollama_client.py		test_ollama_client.py
test_simple_loop.py		test_simple_loop.py
test_system.py		test_system.py
utils.py		utils.py
verify_ollama_context.py		verify_ollama_context.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Systems Evaluation: Monolithic vs Ensemble

Quick Start

1. Install Dependencies

2. Configure Environment

3. Pull Model

4. Run Test Evaluation

5. View Results

Usage

Test Mode (Recommended First Run)

Full Evaluation

Model Selection

Resume from Cache

Clear Cache

Project Structure

Key Features

Expected Results

Configuration

CLI Arguments

Environment Variables

Adding Custom Tasks

Documentation

Requirements

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Systems Evaluation: Monolithic vs Ensemble

Quick Start

1. Install Dependencies

2. Configure Environment

3. Pull Model

4. Run Test Evaluation

5. View Results

Usage

Test Mode (Recommended First Run)

Full Evaluation

Model Selection

Resume from Cache

Clear Cache

Project Structure

Key Features

Expected Results

Configuration

CLI Arguments

Environment Variables

Adding Custom Tasks

Documentation

Requirements

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages