Fix DB connection exhaustion (max_client_conn) by reducing pool sizes by iscekic · Pull Request #497 · Kilo-Org/cloud

iscekic · 2026-02-24T11:40:58Z

Summary

Fixes no more connections allowed (max_client_conn) errors caused by aggregate connection counts across all services exceeding the Supabase PgBouncer limit (3000).

CF workers (cloud-agent, cloud-agent-next, git-token-service): Replace pg.Pool(max:100) with fresh pg.Client per operation, matching the Hyperdrive-recommended pattern already used by kiloclaw and webhook-agent-ingest. Hyperdrive handles connection pooling at the infrastructure level — using a driver-level pool on top is an anti-pattern that holds unnecessary connections open.
session-ingest: Reduce pool max from 5 to 1 (Hyperdrive pools for us).
Vercel Next.js app (src/lib/drizzle.ts): Reduce pool max from 100 to 10 per pool, and apply POSTGRES_MAX_QUERY_TIME as statement_timeout (was validated to exist but never actually enforced on the pool).

Root Cause

At peak, the Supabase PgBouncer dashboard showed 2715/3000 pooler connections. The aggregate max across services was:

Source	Old pool `max`	New
Vercel primary pool	100 × N instances	10 × N
Vercel replica pool	100 × N instances	10 × N
`cloud-agent`	100 × M isolates (via Hyperdrive)	1 per query (Client)
`cloud-agent-next`	100 × M isolates (via Hyperdrive)	1 per query (Client)
`git-token-service`	100 × M isolates (via Hyperdrive)	1 per query (Client)
`session-ingest`	5 per request (via Hyperdrive)	1 per request

Evidence from Axiom

Axiom logs from the incident window (2026-02-24 10:00–11:30 UTC) confirm the diagnosis:

Traffic volume: ~110,000 requests/minute (~1,800/sec) steady-state across 20+ Vercel regions (fra1, bom1, cdg1, arn1, sin1, cpt1, iad1, dxb1, lhr1, hkg1, sfo1, gru1, syd1, pdx1, cle1, etc.).

Error burst at 10:15–10:25 UTC: ~20,000 5xx errors in 10 minutes, across all routes — consistent with a database-level connection exhaustion (not an endpoint-specific bug):

Route	5xx errors (10:15–10:25)
`/api/openrouter/[...path]`	5,716
`/api/upload-cli-session-blob-v2`	4,608
`/api/profile/balance`	4,048
`/api/fim/completions`	1,473
`/api/trpc/[trpc]`	512
All other routes	~3,600

Long-running requests pin connections: /api/openrouter/[...path] had 80,313 requests >5s in the 30-minute window around the incident, with avg duration 24.5s and max 800s (Vercel function timeout). These LLM streaming requests keep function instances warm, each holding a max:100 pool open to PgBouncer.

Connection math: Dozens of warm Vercel instances across 20+ regions × 200 connections per instance (100 primary + 100 replica) easily exceeds the 3000 PgBouncer max_client_conn. Adding CF workers with their own max:100 pools via Hyperdrive pushes it further.

Verification

After deploy, monitor:

Supabase dashboard — pooler connections should drop significantly from the ~2715 baseline
Sentry — no more max_client_conn errors
Vercel/Axiom logs — watch for statement_timeout errors (would indicate previously-slow queries that now hit the 20s guard)

Fixes KILOCODE-WEB-B5R

CF workers (cloud-agent, cloud-agent-next, git-token-service): replace pg.Pool(max:100) with fresh pg.Client per operation, matching the Hyperdrive-recommended pattern already used by kiloclaw and webhook-agent-ingest. Hyperdrive handles connection pooling at the infrastructure level. session-ingest: reduce pool max from 5 to 1 (Hyperdrive pools for us). Vercel Next.js app: reduce pool max from 100 to 10 per pool, and apply the POSTGRES_MAX_QUERY_TIME as statement_timeout (was validated but never actually used). Fixes KILOCODE-WEB-B5R

cloud-agent/src/db/node-postgres.ts

kiloconnect · 2026-02-24T11:43:17Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The PR correctly migrates Cloudflare worker services (cloud-agent, cloud-agent-next, git-token-service) from pg.Pool to per-operation pg.Client, matching the Hyperdrive-recommended pattern of creating a fresh connection per query/transaction. Each client is properly cleaned up via client.end() in finally blocks, preventing connection leaks.

For session-ingest, reducing max from 5 to 1 is appropriate since Hyperdrive handles pooling at the infrastructure level.

The statement_timeout: 10_000 option is correctly preserved in the Client constructor — contrary to the existing Copilot comments, statement_timeout IS a valid pg.ClientConfig option (it's sent as a connection parameter to PostgreSQL).

Files Reviewed (4 files)

cloud-agent/src/db/node-postgres.ts - Pool → Client migration with statement_timeout and proper cleanup
cloud-agent-next/src/db/node-postgres.ts - Pool → Client migration (identical to cloud-agent)
cloudflare-git-token-service/src/db/database.ts - Pool → Client migration with statement_timeout
cloudflare-session-ingest/src/db/kysely.ts - Pool max reduced from 5 to 1

Restores the 10s statement_timeout that was present on the old Pool configs but dropped during the Pool-to-Client migration.

Copilot

Pull request overview

This PR addresses database connection exhaustion issues by reducing connection pool sizes across multiple services. The root cause was aggregate connection counts exceeding the Supabase PgBouncer limit (~3000 connections), leading to no more connections allowed (max_client_conn) errors.

Changes:

Reduced Vercel Next.js app pool sizes from 100 to 10 per pool (primary and replica)
Migrated Cloudflare Workers (cloud-agent, cloud-agent-next, git-token-service) from Pool to Client pattern to align with Hyperdrive best practices
Reduced session-ingest pool size from 5 to 1 to match Hyperdrive pattern

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/lib/drizzle.ts	Reduced pool max from 100 to 10 for both primary and replica pools; attempted to add statement_timeout configuration
cloudflare-session-ingest/src/db/kysely.ts	Reduced pool max from 5 to 1 with documentation explaining Hyperdrive pooling
cloudflare-git-token-service/src/db/database.ts	Migrated from Pool (max:100) to fresh Client per query pattern with proper connection cleanup
cloud-agent/src/db/node-postgres.ts	Migrated from Pool (max:100) to fresh Client per operation with uppercase SQL transaction commands
cloud-agent-next/src/db/node-postgres.ts	Migrated from Pool (max:100) to fresh Client per operation with uppercase SQL transaction commands

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lib/drizzle.ts

iscekic · 2026-02-24T16:40:23Z

Removed the src/lib/drizzle.ts (Vercel Next.js app) changes from this PR — those have been split out into a separate PR #518 to keep the Cloudflare worker changes isolated here.

iscekic self-assigned this Feb 24, 2026

kiloconnect bot reviewed Feb 24, 2026

View reviewed changes

cloud-agent/src/db/node-postgres.ts Outdated Show resolved Hide resolved

Add statement_timeout to Client configs for defense-in-depth

72c5d78

Restores the 10s statement_timeout that was present on the old Pool configs but dropped during the Pool-to-Client migration.

iscekic requested a review from Copilot February 24, 2026 11:50

Copilot started reviewing on behalf of iscekic February 24, 2026 11:51 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

src/lib/drizzle.ts Outdated Show resolved Hide resolved

src/lib/drizzle.ts Outdated Show resolved Hide resolved

Revert drizzle.ts changes (non-cloudflare, separate PR)

b56bcc0

iscekic mentioned this pull request Feb 24, 2026

Reduce Vercel DB pool sizes and enforce statement_timeout #518

Closed

iscekic requested a review from RSO February 24, 2026 16:40

RSO approved these changes Feb 24, 2026

View reviewed changes

iscekic merged commit 5d85e86 into main Feb 24, 2026
12 checks passed

iscekic deleted the fix/reduce-db-connection-pool-sizes branch February 24, 2026 16:42

iscekic mentioned this pull request Feb 24, 2026

Add DB pool observability, OTel pg instrumentation, and statement_timeout #520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DB connection exhaustion (max_client_conn) by reducing pool sizes#497

Fix DB connection exhaustion (max_client_conn) by reducing pool sizes#497
iscekic merged 3 commits intomainfrom
fix/reduce-db-connection-pool-sizes

iscekic commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

kiloconnect bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

iscekic commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iscekic commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Evidence from Axiom

Verification

Uh oh!

Uh oh!

kiloconnect bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

iscekic commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iscekic commented Feb 24, 2026 •

edited

Loading

kiloconnect bot commented Feb 24, 2026 •

edited

Loading