Skip to content

Fix DB connection exhaustion (max_client_conn) by reducing pool sizes#497

Merged
iscekic merged 3 commits intomainfrom
fix/reduce-db-connection-pool-sizes
Feb 24, 2026
Merged

Fix DB connection exhaustion (max_client_conn) by reducing pool sizes#497
iscekic merged 3 commits intomainfrom
fix/reduce-db-connection-pool-sizes

Conversation

@iscekic
Copy link
Contributor

@iscekic iscekic commented Feb 24, 2026

Summary

Fixes no more connections allowed (max_client_conn) errors caused by aggregate connection counts across all services exceeding the Supabase PgBouncer limit (3000).

  • CF workers (cloud-agent, cloud-agent-next, git-token-service): Replace pg.Pool(max:100) with fresh pg.Client per operation, matching the Hyperdrive-recommended pattern already used by kiloclaw and webhook-agent-ingest. Hyperdrive handles connection pooling at the infrastructure level — using a driver-level pool on top is an anti-pattern that holds unnecessary connections open.
  • session-ingest: Reduce pool max from 5 to 1 (Hyperdrive pools for us).
  • Vercel Next.js app (src/lib/drizzle.ts): Reduce pool max from 100 to 10 per pool, and apply POSTGRES_MAX_QUERY_TIME as statement_timeout (was validated to exist but never actually enforced on the pool).

Root Cause

At peak, the Supabase PgBouncer dashboard showed 2715/3000 pooler connections. The aggregate max across services was:

Source Old pool max New
Vercel primary pool 100 × N instances 10 × N
Vercel replica pool 100 × N instances 10 × N
cloud-agent 100 × M isolates (via Hyperdrive) 1 per query (Client)
cloud-agent-next 100 × M isolates (via Hyperdrive) 1 per query (Client)
git-token-service 100 × M isolates (via Hyperdrive) 1 per query (Client)
session-ingest 5 per request (via Hyperdrive) 1 per request

Evidence from Axiom

Axiom logs from the incident window (2026-02-24 10:00–11:30 UTC) confirm the diagnosis:

Traffic volume: ~110,000 requests/minute (~1,800/sec) steady-state across 20+ Vercel regions (fra1, bom1, cdg1, arn1, sin1, cpt1, iad1, dxb1, lhr1, hkg1, sfo1, gru1, syd1, pdx1, cle1, etc.).

Error burst at 10:15–10:25 UTC: ~20,000 5xx errors in 10 minutes, across all routes — consistent with a database-level connection exhaustion (not an endpoint-specific bug):

Route 5xx errors (10:15–10:25)
/api/openrouter/[...path] 5,716
/api/upload-cli-session-blob-v2 4,608
/api/profile/balance 4,048
/api/fim/completions 1,473
/api/trpc/[trpc] 512
All other routes ~3,600

Long-running requests pin connections: /api/openrouter/[...path] had 80,313 requests >5s in the 30-minute window around the incident, with avg duration 24.5s and max 800s (Vercel function timeout). These LLM streaming requests keep function instances warm, each holding a max:100 pool open to PgBouncer.

Connection math: Dozens of warm Vercel instances across 20+ regions × 200 connections per instance (100 primary + 100 replica) easily exceeds the 3000 PgBouncer max_client_conn. Adding CF workers with their own max:100 pools via Hyperdrive pushes it further.

Verification

After deploy, monitor:

  1. Supabase dashboard — pooler connections should drop significantly from the ~2715 baseline
  2. Sentry — no more max_client_conn errors
  3. Vercel/Axiom logs — watch for statement_timeout errors (would indicate previously-slow queries that now hit the 20s guard)

Fixes KILOCODE-WEB-B5R

CF workers (cloud-agent, cloud-agent-next, git-token-service): replace
pg.Pool(max:100) with fresh pg.Client per operation, matching the
Hyperdrive-recommended pattern already used by kiloclaw and
webhook-agent-ingest. Hyperdrive handles connection pooling at the
infrastructure level.

session-ingest: reduce pool max from 5 to 1 (Hyperdrive pools for us).

Vercel Next.js app: reduce pool max from 100 to 10 per pool, and apply
the POSTGRES_MAX_QUERY_TIME as statement_timeout (was validated but
never actually used).

Fixes KILOCODE-WEB-B5R
@iscekic iscekic self-assigned this Feb 24, 2026
@kiloconnect
Copy link
Contributor

kiloconnect bot commented Feb 24, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The PR correctly migrates Cloudflare worker services (cloud-agent, cloud-agent-next, git-token-service) from pg.Pool to per-operation pg.Client, matching the Hyperdrive-recommended pattern of creating a fresh connection per query/transaction. Each client is properly cleaned up via client.end() in finally blocks, preventing connection leaks.

For session-ingest, reducing max from 5 to 1 is appropriate since Hyperdrive handles pooling at the infrastructure level.

The statement_timeout: 10_000 option is correctly preserved in the Client constructor — contrary to the existing Copilot comments, statement_timeout IS a valid pg.ClientConfig option (it's sent as a connection parameter to PostgreSQL).

Files Reviewed (4 files)
  • cloud-agent/src/db/node-postgres.ts - Pool → Client migration with statement_timeout and proper cleanup
  • cloud-agent-next/src/db/node-postgres.ts - Pool → Client migration (identical to cloud-agent)
  • cloudflare-git-token-service/src/db/database.ts - Pool → Client migration with statement_timeout
  • cloudflare-session-ingest/src/db/kysely.ts - Pool max reduced from 5 to 1

Restores the 10s statement_timeout that was present on the old Pool
configs but dropped during the Pool-to-Client migration.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses database connection exhaustion issues by reducing connection pool sizes across multiple services. The root cause was aggregate connection counts exceeding the Supabase PgBouncer limit (~3000 connections), leading to no more connections allowed (max_client_conn) errors.

Changes:

  • Reduced Vercel Next.js app pool sizes from 100 to 10 per pool (primary and replica)
  • Migrated Cloudflare Workers (cloud-agent, cloud-agent-next, git-token-service) from Pool to Client pattern to align with Hyperdrive best practices
  • Reduced session-ingest pool size from 5 to 1 to match Hyperdrive pattern

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/lib/drizzle.ts Reduced pool max from 100 to 10 for both primary and replica pools; attempted to add statement_timeout configuration
cloudflare-session-ingest/src/db/kysely.ts Reduced pool max from 5 to 1 with documentation explaining Hyperdrive pooling
cloudflare-git-token-service/src/db/database.ts Migrated from Pool (max:100) to fresh Client per query pattern with proper connection cleanup
cloud-agent/src/db/node-postgres.ts Migrated from Pool (max:100) to fresh Client per operation with uppercase SQL transaction commands
cloud-agent-next/src/db/node-postgres.ts Migrated from Pool (max:100) to fresh Client per operation with uppercase SQL transaction commands

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@iscekic
Copy link
Contributor Author

iscekic commented Feb 24, 2026

Removed the src/lib/drizzle.ts (Vercel Next.js app) changes from this PR — those have been split out into a separate PR #518 to keep the Cloudflare worker changes isolated here.

@iscekic iscekic requested a review from RSO February 24, 2026 16:40
@iscekic iscekic merged commit 5d85e86 into main Feb 24, 2026
12 checks passed
@iscekic iscekic deleted the fix/reduce-db-connection-pool-sizes branch February 24, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants