Skip to content

infra: agent backend cannot scale beyond a few concurrent users #65

@GeneralJerel

Description

@GeneralJerel

Problem

The agent backend runs on a single uvicorn worker process with an in-memory checkpointer on a 512MB Render starter instance. This is a global bottleneck — not per-user. All concurrent users share the same event loop, the same memory pool, and the same 200-thread checkpoint limit.

Currently the app breaks at ~3 concurrent connections (#63). At production scale (100-1000 users), it would be effectively unusable.

Architecture bottlenecks

1. Single worker process

  • uvicorn runs with 1 worker (default) — all requests share one Python event loop
  • Each GPT-5.4 visualization call takes 10-30s
  • LangGraph has synchronous sections that block the event loop
  • Throughput: ~2-6 visualization requests/minute

2. In-memory checkpointer (BoundedMemorySaver)

  • All conversation state stored in RAM — shared global pool of 200 threads
  • FIFO eviction: after 200 conversations across ALL users, oldest threads are silently deleted
  • Users lose conversation context mid-session with no error
  • Not thread-safe — designed for single-process async only
  • On 512MB starter plan, memory pressure builds well before 200 threads

3. No backpressure or error surfacing

  • When the backend is saturated, requests hang silently — no timeout, no error, no retry
  • Frontend shows no indication that the agent is overloaded
  • Health check at /health returns 200 even when the event loop is blocked

Scale projections

Concurrent users Behavior
1-5 Works fine
10-20 Noticeable latency, requests queue
50+ Requests timeout, SSE connections drop
100+ Effectively down, health checks fail, Render restarts

Proposed solution

Phase 1 — Quick wins (config changes only)

  • Add --workers 4 to uvicorn startCommand in render.yaml — multiplies throughput ~4x
  • Upgrade agent service from starter (512MB) to standard (1GB+) in render.yaml
  • Enable rate limiting (RATE_LIMIT_ENABLED=true) with reasonable limits (e.g. 20 req/min per IP)

Phase 2 — Persistent checkpointer

  • Replace BoundedMemorySaver with PostgreSQL or SQLite async checkpointer
  • Conversation state survives restarts and doesn't consume RAM
  • No more silent thread eviction — threads persist until explicitly cleaned up
  • Render already supports managed Postgres — can add as a service in render.yaml

Phase 3 — Error handling and backpressure

  • Add frontend timeout — show error after ~30s of no response instead of hanging forever
  • Add backend concurrency limit — return 503 "busy" when at capacity rather than queuing indefinitely
  • Add connection health monitoring — detect dropped SSE connections and surface to user
  • Reuse thread IDs per browser tab (sessionStorage) to avoid creating unnecessary threads

Phase 4 — Horizontal scaling

  • Use Gunicorn with uvicorn workers for proper process management
  • Verify Render auto-scaling (1-3 instances) works correctly with persistent checkpointer
  • Add Redis or Postgres for shared state across instances
  • Load test at target concurrency (100+ users) to validate

Related issues

Key files

  • apps/agent/main.py — uvicorn config, BoundedMemorySaver(max_threads=200)
  • apps/agent/src/bounded_memory_saver.py — FIFO eviction logic
  • render.yaml — Render service config (starter plan, no worker config)
  • apps/app/src/app/api/copilotkit/route.ts — Frontend → agent connection

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions