Fix DB connection exhaustion (max_client_conn) by reducing pool sizes#497
Fix DB connection exhaustion (max_client_conn) by reducing pool sizes#497
Conversation
CF workers (cloud-agent, cloud-agent-next, git-token-service): replace pg.Pool(max:100) with fresh pg.Client per operation, matching the Hyperdrive-recommended pattern already used by kiloclaw and webhook-agent-ingest. Hyperdrive handles connection pooling at the infrastructure level. session-ingest: reduce pool max from 5 to 1 (Hyperdrive pools for us). Vercel Next.js app: reduce pool max from 100 to 10 per pool, and apply the POSTGRES_MAX_QUERY_TIME as statement_timeout (was validated but never actually used). Fixes KILOCODE-WEB-B5R
Code Review SummaryStatus: No Issues Found | Recommendation: Merge The PR correctly migrates Cloudflare worker services ( For The Files Reviewed (4 files)
|
Restores the 10s statement_timeout that was present on the old Pool configs but dropped during the Pool-to-Client migration.
There was a problem hiding this comment.
Pull request overview
This PR addresses database connection exhaustion issues by reducing connection pool sizes across multiple services. The root cause was aggregate connection counts exceeding the Supabase PgBouncer limit (~3000 connections), leading to no more connections allowed (max_client_conn) errors.
Changes:
- Reduced Vercel Next.js app pool sizes from 100 to 10 per pool (primary and replica)
- Migrated Cloudflare Workers (cloud-agent, cloud-agent-next, git-token-service) from Pool to Client pattern to align with Hyperdrive best practices
- Reduced session-ingest pool size from 5 to 1 to match Hyperdrive pattern
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lib/drizzle.ts | Reduced pool max from 100 to 10 for both primary and replica pools; attempted to add statement_timeout configuration |
| cloudflare-session-ingest/src/db/kysely.ts | Reduced pool max from 5 to 1 with documentation explaining Hyperdrive pooling |
| cloudflare-git-token-service/src/db/database.ts | Migrated from Pool (max:100) to fresh Client per query pattern with proper connection cleanup |
| cloud-agent/src/db/node-postgres.ts | Migrated from Pool (max:100) to fresh Client per operation with uppercase SQL transaction commands |
| cloud-agent-next/src/db/node-postgres.ts | Migrated from Pool (max:100) to fresh Client per operation with uppercase SQL transaction commands |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Removed the |
Summary
Fixes
no more connections allowed (max_client_conn)errors caused by aggregate connection counts across all services exceeding the Supabase PgBouncer limit (3000).cloud-agent,cloud-agent-next,git-token-service): Replacepg.Pool(max:100)with freshpg.Clientper operation, matching the Hyperdrive-recommended pattern already used bykiloclawandwebhook-agent-ingest. Hyperdrive handles connection pooling at the infrastructure level — using a driver-level pool on top is an anti-pattern that holds unnecessary connections open.session-ingest: Reduce poolmaxfrom 5 to 1 (Hyperdrive pools for us).src/lib/drizzle.ts): Reduce poolmaxfrom 100 to 10 per pool, and applyPOSTGRES_MAX_QUERY_TIMEasstatement_timeout(was validated to exist but never actually enforced on the pool).Root Cause
At peak, the Supabase PgBouncer dashboard showed 2715/3000 pooler connections. The aggregate
maxacross services was:maxcloud-agentcloud-agent-nextgit-token-servicesession-ingestEvidence from Axiom
Axiom logs from the incident window (2026-02-24 10:00–11:30 UTC) confirm the diagnosis:
Traffic volume: ~110,000 requests/minute (~1,800/sec) steady-state across 20+ Vercel regions (fra1, bom1, cdg1, arn1, sin1, cpt1, iad1, dxb1, lhr1, hkg1, sfo1, gru1, syd1, pdx1, cle1, etc.).
Error burst at 10:15–10:25 UTC: ~20,000 5xx errors in 10 minutes, across all routes — consistent with a database-level connection exhaustion (not an endpoint-specific bug):
/api/openrouter/[...path]/api/upload-cli-session-blob-v2/api/profile/balance/api/fim/completions/api/trpc/[trpc]Long-running requests pin connections:
/api/openrouter/[...path]had 80,313 requests >5s in the 30-minute window around the incident, with avg duration 24.5s and max 800s (Vercel function timeout). These LLM streaming requests keep function instances warm, each holding amax:100pool open to PgBouncer.Connection math: Dozens of warm Vercel instances across 20+ regions × 200 connections per instance (100 primary + 100 replica) easily exceeds the 3000 PgBouncer
max_client_conn. Adding CF workers with their ownmax:100pools via Hyperdrive pushes it further.Verification
After deploy, monitor:
max_client_connerrorsstatement_timeouterrors (would indicate previously-slow queries that now hit the 20s guard)Fixes KILOCODE-WEB-B5R