Skip to content

chore: migrate to golang tracer instead of openmetrics in prometheus#3959

Open
mark-vass-konghq wants to merge 4 commits intomainfrom
chore/migrate-to-golang-tracer
Open

chore: migrate to golang tracer instead of openmetrics in prometheus#3959
mark-vass-konghq wants to merge 4 commits intomainfrom
chore/migrate-to-golang-tracer

Conversation

@mark-vass-konghq
Copy link
Contributor

@mark-vass-konghq mark-vass-konghq commented Mar 17, 2026

Migrate to golang tracer instead of openmetrics in prometheus.

Summary by CodeRabbit

  • Refactor

    • Simplified telemetry initialization and runtime metrics lifecycle; runtime metrics collector is now initialized separately and stored on Application structs.
  • Chores

    • Added config toggle to disable Prometheus default collectors (defaults to false).
    • Updated startup and cleanup wiring across services to align with the new telemetry lifecycle.
  • Style

    • Standardized structured error logging to log errors as generic values under "error".

@mark-vass-konghq mark-vass-konghq self-assigned this Mar 17, 2026
@mark-vass-konghq mark-vass-konghq requested a review from a team as a code owner March 17, 2026 13:59
@mark-vass-konghq mark-vass-konghq added the kind/refactor Code refactor, cleanup or minor improvement label Mar 17, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Converted the runtime metrics collector into a lightweight placeholder with a new constructor; removed it from the TelemetryHandler signature; added config to optionally unregister Prometheus default Go/Process collectors; updated DI/wiring and minor logging refinements across services.

Changes

Cohort / File(s) Summary
Telemetry core & config
app/common/telemetry.go, app/config/telemetry.go
Reworked runtime metrics lifecycle: RuntimeMetricsCollector is now an empty type with NewRuntimeMetricsCollector(meterProvider, logger) (RuntimeMetricsCollector, error); NewTelemetryHandler no longer accepts the runtime collector; added PrometheusExportersMetricsTelemetryConfig.DisableDefaultCollectors and conditional unregistration of Prometheus default collectors; adjusted Prometheus wiring and related logging.
Service wiring / DI (generated)
cmd/balance-worker/wire_gen.go, cmd/billing-worker/wire_gen.go, cmd/notification-service/wire_gen.go, cmd/server/wire_gen.go, cmd/sink-worker/wire_gen.go
Updated telemetry wiring to remove runtime collector from NewTelemetryHandler calls; moved runtime collector creation to a later point and added it to Application structs; adjusted init/cleanup and error paths to match new lifecycle.
Service wiring (manual wire files)
cmd/balance-worker/wire.go, cmd/billing-worker/wire.go, cmd/notification-service/wire.go, cmd/sink-worker/wire.go
Added public RuntimeMetricsCollector common.RuntimeMetricsCollector field to Application structs and adjusted embedded fields/formatting where applicable.
Telemetry server & termination
cmd/server/wire_gen.go
Telemetry initialization reordered; NewTelemetryServer and NewTerminationChecker constructors surfaced in wiring; cleanup/error handling updated for new telemetry/termination resources.
Logging tweaks
app/common/openmeter_sinkworker.go, openmeter/ingest/kafkaingest/collector.go, openmeter/sink/sink.go
Replaced slog.String("err", err.Error()) with slog.Any("error", err) in several error logs to emit structured error values instead of error strings.

Sequence Diagram(s)

(Skipped — changes are refactor/DI adjustments, config, and lightweight lifecycle; no new multi-component sequential flow requiring diagram.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • GAlexIHU
  • gergely-kurucz-konghq
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main refactoring effort: migrating from OpenMetrics to Go runtime tracing in Prometheus. Changes throughout the PR show removal of the RuntimeMetricsCollector parameter from NewTelemetryHandler, addition of DisableDefaultCollectors config, and restructuring of runtime metrics initialization across multiple wire files, all aligned with the stated objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/migrate-to-golang-tracer
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mark-vass-konghq
Copy link
Contributor Author

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/config/telemetry.go`:
- Around line 265-266: The CollectorsEnabled field is inverted (true causes the
default Go/Process collectors to be unregistered) and the ConfigureTelemetry()
path doesn't explicitly set a default for
telemetry.metrics.exporters.prometheus.collectorsEnabled; rename the field to a
clearer name (e.g., DisableDefaultCollectors or UseOTelRuntimeMetrics) in the
telemetry config struct (replace CollectorsEnabled with the new name and update
all usages such as where collectors are unregistered in
app/common/telemetry.go), and update ConfigureTelemetry() to explicitly set a
sensible default for telemetry.metrics.exporters.prometheus.collectorsEnabled
(true/false as appropriate for backward compatibility) so the behavior is
discoverable and consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dd63c0eb-86a5-4c61-ae25-d86b4182fc1c

📥 Commits

Reviewing files that changed from the base of the PR and between 4f30760 and 033341a.

📒 Files selected for processing (7)
  • app/common/telemetry.go
  • app/config/telemetry.go
  • cmd/balance-worker/wire_gen.go
  • cmd/billing-worker/wire_gen.go
  • cmd/notification-service/wire_gen.go
  • cmd/server/wire_gen.go
  • cmd/sink-worker/wire_gen.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/common/telemetry.go (1)

231-235: Consider only unregistering collectors when Prometheus is enabled.

The unregistration happens even if Prometheus.Enabled is false (checked later at line 239). While not a bug, it's a bit wasteful to modify the global registry when Prometheus metrics aren't being exposed anyway.

🔧 Optional refactor to scope the unregistration
-	if metricsConf.Exporters.Prometheus.DisableDefaultCollectors {
-		prometheus.Unregister(collectors.NewGoCollector())
-		prometheus.Unregister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
-	}
-
 	router := chi.NewRouter()
 	router.Mount("/debug", middleware.Profiler())

 	if metricsConf.Exporters.Prometheus.Enabled {
+		if metricsConf.Exporters.Prometheus.DisableDefaultCollectors {
+			prometheus.Unregister(collectors.NewGoCollector())
+			prometheus.Unregister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
+		}
 		router.Handle("/metrics", promhttp.Handler())
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/common/telemetry.go` around lines 231 - 235, The code currently
unregisters default Prometheus collectors unconditionally via
metricsConf.Exporters.Prometheus.DisableDefaultCollectors (calling
prometheus.Unregister on collectors.NewGoCollector and
collectors.NewProcessCollector) even when Prometheus is not enabled later;
change the logic to check Prometheus.Enabled first and only perform the
unregister when both metricsConf.Exporters.Prometheus.DisableDefaultCollectors
is true and Prometheus.Enabled is true, so the global Prometheus registry is
only modified when Prometheus metrics are actually being exposed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/common/telemetry.go`:
- Around line 231-235: The code currently unregisters default Prometheus
collectors unconditionally via
metricsConf.Exporters.Prometheus.DisableDefaultCollectors (calling
prometheus.Unregister on collectors.NewGoCollector and
collectors.NewProcessCollector) even when Prometheus is not enabled later;
change the logic to check Prometheus.Enabled first and only perform the
unregister when both metricsConf.Exporters.Prometheus.DisableDefaultCollectors
is true and Prometheus.Enabled is true, so the global Prometheus registry is
only modified when Prometheus metrics are actually being exposed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b82a969-d2e7-4a3c-9def-3ead917b4235

📥 Commits

Reviewing files that changed from the base of the PR and between 033341a and c309633.

📒 Files selected for processing (2)
  • app/common/telemetry.go
  • app/config/telemetry.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/config/telemetry.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/common/telemetry.go`:
- Around line 297-312: The Wire provider for runtime metrics
(NewRuntimeMetricsCollector which calls runtime.Start) is being dropped unless
referenced; add a RuntimeMetricsCollector field to the Application struct in
both cmd/balance-worker/wire.go and cmd/notification-service/wire.go so Wire
keeps the provider in those graphs, then re-run wire to regenerate the DI code;
ensure the Application structs reference the RuntimeMetricsCollector type (not
just the marker) so NewRuntimeMetricsCollector is retained.
- Around line 231-234: The current block unregisters both
collectors.NewGoCollector() and collectors.NewProcessCollector(...); change it
to only unregister the Go runtime collector: keep the call that unregisters
collectors.NewProcessCollector(...) intact and remove/unskip unregistering
collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}) so only
prometheus.Unregister(collectors.NewGoCollector()) runs when
metricsConf.Exporters.Prometheus.DisableDefaultCollectors is true; ensure the
condition still references
metricsConf.Exporters.Prometheus.DisableDefaultCollectors and leave
prometheus.Unregister(collectors.NewProcessCollector(...)) out so OS process
metrics remain exposed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 92f52a18-ea5a-4647-bf1a-23fd0df3a217

📥 Commits

Reviewing files that changed from the base of the PR and between c309633 and 78ad9e0.

📒 Files selected for processing (7)
  • app/common/telemetry.go
  • app/config/telemetry.go
  • cmd/balance-worker/wire_gen.go
  • cmd/billing-worker/wire_gen.go
  • cmd/notification-service/wire_gen.go
  • cmd/server/wire_gen.go
  • cmd/sink-worker/wire_gen.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/sink-worker/wire_gen.go
  • cmd/billing-worker/wire_gen.go

Comment on lines +231 to +234
if metricsConf.Exporters.Prometheus.DisableDefaultCollectors {
prometheus.Unregister(collectors.NewGoCollector())
prometheus.Unregister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does the OpenTelemetry Go package go.opentelemetry.io/contrib/instrumentation/runtimecollect OS process metrics comparable to Prometheus client_golangProcessCollector, or only Go runtime metrics?

💡 Result:

go.opentelemetry.io/contrib/instrumentation/runtime collects Go runtime metrics (e.g., go.memory.*, go.goroutine.count, go.processor.limit, GC-related values), sourced from Go’s runtime/metrics (and optionally some deprecated runtime.* metrics). It does not collect OS-level process metrics like CPU seconds, RSS, open FDs, etc. that Prometheus ProcessCollector provides. [1] [2]

Prometheus ProcessCollector explicitly exports process metrics such as CPU, memory, file-descriptor usage, and process start time. [3]

So: OpenTelemetry .../instrumentation/runtime == Go runtime metrics only; Prometheus ProcessCollector == OS process metrics. [1] [3]

Sources:
[1] https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/runtime
[2] https://github.com/open-telemetry/opentelemetry-go-contrib/blob/instrumentation/runtime/v0.67.0/instrumentation/runtime/runtime.go
[3] https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/collectors


Only unregister NewGoCollector() here—keep ProcessCollector().

OTel's runtime.Start() replaces Go runtime metrics, but it doesn't cover OS process metrics like CPU time, memory usage, and file descriptor counts. Those come exclusively from Prometheus's ProcessCollector, so disabling it would silently drop those signals from /metrics. If you need to disable the default Go collector, just remove that one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/common/telemetry.go` around lines 231 - 234, The current block
unregisters both collectors.NewGoCollector() and
collectors.NewProcessCollector(...); change it to only unregister the Go runtime
collector: keep the call that unregisters collectors.NewProcessCollector(...)
intact and remove/unskip unregistering
collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}) so only
prometheus.Unregister(collectors.NewGoCollector()) runs when
metricsConf.Exporters.Prometheus.DisableDefaultCollectors is true; ensure the
condition still references
metricsConf.Exporters.Prometheus.DisableDefaultCollectors and leave
prometheus.Unregister(collectors.NewProcessCollector(...)) out so OS process
metrics remain exposed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openmeter/sink/sink.go (1)

720-725: ⚠️ Potential issue | 🔴 Critical

Data race: goroutine mutates the outer err variable without synchronization.

The goroutine launched at line 720 assigns to err from the enclosing Run function scope on line 723 (if err = json.Unmarshal(...)). Since the main consumer loop and other code paths also write to this same variable, concurrent modifications create a race condition that can corrupt error handling and clobber unrelated error values.

The pattern in openmeter/ingest/kafkaingest/collector.go (line 193) handles the identical Kafka stats scenario correctly by using a short declaration instead.

🔧 Suggested fix
 			go func() {
 				var stats kafkastats.Stats

-				if err = json.Unmarshal([]byte(e.String()), &stats); err != nil {
+				if err := json.Unmarshal([]byte(e.String()), &stats); err != nil {
 					logger.WarnContext(ctx, "failed to unmarshal Kafka client stats", slog.Any("error", err))
+					return
 				}

-				ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
+				statsCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
 				defer cancel()

-				s.kafkaMetrics.Add(ctx, &stats)
+				s.kafkaMetrics.Add(statsCtx, &stats)
 			}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openmeter/sink/sink.go` around lines 720 - 725, The goroutine handling Kafka
stats is mutating the outer Run-scope variable err (via `if err =
json.Unmarshal(...)`), causing a data race; change that assignment to use a new
local error variable (e.g., `if err := json.Unmarshal(...)`) inside the
anonymous function so the goroutine uses its own `err` and does not modify the
enclosing `err`. Locate the anonymous func where `var stats kafkastats.Stats`
and `json.Unmarshal([]byte(e.String()), &stats)` are used in sink.go and replace
the inner `if err = ...` with a short declaration to eliminate the race.
♻️ Duplicate comments (1)
app/common/telemetry.go (1)

231-234: ⚠️ Potential issue | 🟠 Major

Small gotcha: keep the Prometheus process collector registered.

runtime.Start() only replaces Go runtime metrics. Line 233 unregisters collectors.NewProcessCollector(...) too, which means /metrics loses process CPU, RSS, start-time, and FD signals when this flag is enabled. I’d only unregister NewGoCollector() here.

🛠️ Suggested tweak
 if metricsConf.Exporters.Prometheus.DisableDefaultCollectors {
 	prometheus.Unregister(collectors.NewGoCollector())
-	prometheus.Unregister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/common/telemetry.go` around lines 231 - 234, The code unregisters both
runtime and process collectors when
metricsConf.Exporters.Prometheus.DisableDefaultCollectors is true; change the
block so only collectors.NewGoCollector() is unregistered and leave
collectors.NewProcessCollector(...) registered. Locate the conditional that
checks metricsConf.Exporters.Prometheus.DisableDefaultCollectors and replace the
prometheus.Unregister call that targets
collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}) so it is not
called (only call prometheus.Unregister(collectors.NewGoCollector())). Ensure
the rest of the export setup remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@openmeter/sink/sink.go`:
- Around line 720-725: The goroutine handling Kafka stats is mutating the outer
Run-scope variable err (via `if err = json.Unmarshal(...)`), causing a data
race; change that assignment to use a new local error variable (e.g., `if err :=
json.Unmarshal(...)`) inside the anonymous function so the goroutine uses its
own `err` and does not modify the enclosing `err`. Locate the anonymous func
where `var stats kafkastats.Stats` and `json.Unmarshal([]byte(e.String()),
&stats)` are used in sink.go and replace the inner `if err = ...` with a short
declaration to eliminate the race.

---

Duplicate comments:
In `@app/common/telemetry.go`:
- Around line 231-234: The code unregisters both runtime and process collectors
when metricsConf.Exporters.Prometheus.DisableDefaultCollectors is true; change
the block so only collectors.NewGoCollector() is unregistered and leave
collectors.NewProcessCollector(...) registered. Locate the conditional that
checks metricsConf.Exporters.Prometheus.DisableDefaultCollectors and replace the
prometheus.Unregister call that targets
collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}) so it is not
called (only call prometheus.Unregister(collectors.NewGoCollector())). Ensure
the rest of the export setup remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 03e48eb3-a8c7-4b29-be45-50607966608f

📥 Commits

Reviewing files that changed from the base of the PR and between 78ad9e0 and 8a59e79.

📒 Files selected for processing (14)
  • app/common/openmeter_sinkworker.go
  • app/common/telemetry.go
  • app/config/telemetry.go
  • cmd/balance-worker/wire.go
  • cmd/balance-worker/wire_gen.go
  • cmd/billing-worker/wire.go
  • cmd/billing-worker/wire_gen.go
  • cmd/notification-service/wire.go
  • cmd/notification-service/wire_gen.go
  • cmd/server/wire_gen.go
  • cmd/sink-worker/wire.go
  • cmd/sink-worker/wire_gen.go
  • openmeter/ingest/kafkaingest/collector.go
  • openmeter/sink/sink.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/refactor Code refactor, cleanup or minor improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants