HYPERFLEET-762: Add OpenTelemetry configuration to standard environme…#89
HYPERFLEET-762: Add OpenTelemetry configuration to standard environme…#89ldornele wants to merge 9 commits intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @ldornele. Thanks for your PR. I'm waiting for a openshift-hyperfleet member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThe PR removes the Sequence Diagram(s)sequenceDiagram
participant Startup as Server Startup (cmd)
participant Env as Environment Variables (TRACING_ENABLED, OTEL_*)
participant Config as Application Config (logging.otel.enabled)
participant Telemetry as telemetry.InitTraceProvider
participant Exporter as OTLP/Stdout Exporter
Startup->>Env: Read TRACING_ENABLED and OTEL_* vars
Startup->>Config: Load config (logging.otel.enabled)
Startup->>Startup: Resolve tracingEnabled (TRACING_ENABLED overrides config if set)
Startup->>Config: Set Config.Logging.OTel.Enabled = tracingEnabled
Startup->>Telemetry: InitTraceProvider(ctx, serviceName, version)
Telemetry->>Env: Read OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_TRACES_SAMPLER, OTEL_TRACES_SAMPLER_ARG, OTEL_SERVICE_NAME
Telemetry->>Exporter: Create OTLP exporter (HTTP or gRPC) if endpoint set, else stdout exporter
Telemetry->>Telemetry: Configure sampler from OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG
Telemetry->>Telemetry: Set TextMapPropagator (TraceContext + Baggage)
Telemetry->>Startup: Return TracerProvider
sequenceDiagram
participant OldStartup as Previous Startup Flow
participant OldConfig as App Config (logging.otel.sampling_rate)
participant OldTelemetry as InitTraceProvider(service, version, samplingRate)
participant OldExporter as stdouttrace Exporter
OldStartup->>OldConfig: Read logging.otel.sampling_rate (default 1.0)
OldStartup->>OldTelemetry: InitTraceProvider(ctx, serviceName, version, samplingRate)
OldTelemetry->>OldExporter: Create stdouttrace exporter
OldTelemetry->>OldStartup: Return TracerProvider
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
pkg/logger/fields.go (1)
45-45: Inconsistent field naming convention.
FieldTracingEnableduses uppercase"TRACING_ENABLED"while all other field constants in this file usesnake_case(e.g.,FieldOTelEnabled = "otel_enabled",FieldSamplingRate = "sampling_rate"). This inconsistency could cause confusion in log queries.♻️ Suggested fix for consistency
- FieldTracingEnabled = "TRACING_ENABLED" + FieldTracingEnabled = "tracing_enabled"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/logger/fields.go` at line 45, FieldTracingEnabled uses uppercase "TRACING_ENABLED" which is inconsistent with the snake_case convention used by other constants (e.g., FieldOTelEnabled, FieldSamplingRate); rename or change the constant value to snake_case—replace the value of FieldTracingEnabled with "tracing_enabled" (or rename the identifier to match project convention) so it matches the pattern used by other fields in pkg/logger/fields.go.pkg/telemetry/otel_test.go (2)
140-145: Minor: Error message omits the actual error.The shutdown error message doesn't include the error value, making debugging harder.
♻️ Suggested fix
defer func(ctx context.Context, tp *trace.TracerProvider) { err := Shutdown(ctx, tp) if err != nil { - t.Errorf("Failed to shutdown trace provider") + t.Errorf("Failed to shutdown trace provider: %v", err) } }(ctx, tp)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 140 - 145, The defer's error path in the anonymous cleanup function calls t.Errorf("Failed to shutdown trace provider") without including the returned error; update the t.Errorf call inside the deferred function that wraps Shutdown(ctx, tp) to include the err value (e.g., t.Errorf("Failed to shutdown trace provider: %v", err)) so the Shutdown error from trace.TracerProvider is logged for debugging.
151-160: Incomplete sampling assertion forexpectedSample=truecases.When
expectedSampleis true, the test only verifiesspan.SpanContext().IsValid(). This doesn't confirm that the span was actually sampled—a span context can be valid but not sampled. Consider also checkingIsSampled()for consistency with theexpectedSample=falsebranch.♻️ Suggested fix
if tt.expectedSample { - if !span.SpanContext().IsValid() { - t.Error("Expected valid span context for sampling=true") + if !span.SpanContext().IsValid() || !span.SpanContext().TraceFlags().IsSampled() { + t.Error("Expected valid and sampled span context for sampling=true") } } else {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 151 - 160, The test's sampling assertion is incomplete: when expectedSample is true you only check span.SpanContext().IsValid(), which doesn't ensure the span was actually sampled; update the true-branch to also assert span.SpanContext().TraceFlags().IsSampled() (mirroring the false-branch check), so when expectedSample is true you fail the test if TraceFlags().IsSampled() is false while keeping the existing IsValid() check for span.SpanContext().docs/logging.md (1)
377-399: Code snippet differs from actual implementation.The initialization snippet at lines 377-399 simplifies error handling compared to the actual implementation in
cmd/hyperfleet-api/servecmd/cmd.go. Notably:
- Line 380 ignores the
strconv.ParseBoolerror, whereas the actual code logs a warning and falls back to config.- Line 395 uses
defer tp.Shutdown(...)directly, whereas the actual code usestelemetry.Shutdown()with proper timeout context.This is acceptable for a conceptual illustration, but consider adding a note that this is a simplified example.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/logging.md` around lines 377 - 399, The snippet in the docs simplifies error handling compared to the real implementation: update the text to note it’s a simplified example and call out the behavioral differences — specifically that strconv.ParseBool errors are logged via logger.WithError and fall back to environments.Environment().Config.Logging.OTel.Enabled instead of being ignored, and that the real code calls telemetry.Shutdown(ctx, tp) with a timeout context rather than deferring tp.Shutdown(...) directly after telemetry.InitTraceProvider; mention the symbols strconv.ParseBool, logger.WithError, telemetry.InitTraceProvider, tp.Shutdown and telemetry.Shutdown so readers can map the doc snippet to the actual implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/config.md`:
- Line 691: Update the checklist entry that currently lists
HYPERFLEET_CLUSTER_ADAPTERS and HYPERFLEET_NODEPOOL_ADAPTERS to use the
canonical environment variable names HYPERFLEET_ADAPTERS_REQUIRED_CLUSTER and
HYPERFLEET_ADAPTERS_REQUIRED_NODEPOOL so it matches the Complete Reference (the
entries referenced at lines 466-467); locate the checklist line in
docs/config.md and replace the two variable names accordingly to ensure
documentation consistency.
---
Nitpick comments:
In `@docs/logging.md`:
- Around line 377-399: The snippet in the docs simplifies error handling
compared to the real implementation: update the text to note it’s a simplified
example and call out the behavioral differences — specifically that
strconv.ParseBool errors are logged via logger.WithError and fall back to
environments.Environment().Config.Logging.OTel.Enabled instead of being ignored,
and that the real code calls telemetry.Shutdown(ctx, tp) with a timeout context
rather than deferring tp.Shutdown(...) directly after
telemetry.InitTraceProvider; mention the symbols strconv.ParseBool,
logger.WithError, telemetry.InitTraceProvider, tp.Shutdown and
telemetry.Shutdown so readers can map the doc snippet to the actual
implementation.
In `@pkg/logger/fields.go`:
- Line 45: FieldTracingEnabled uses uppercase "TRACING_ENABLED" which is
inconsistent with the snake_case convention used by other constants (e.g.,
FieldOTelEnabled, FieldSamplingRate); rename or change the constant value to
snake_case—replace the value of FieldTracingEnabled with "tracing_enabled" (or
rename the identifier to match project convention) so it matches the pattern
used by other fields in pkg/logger/fields.go.
In `@pkg/telemetry/otel_test.go`:
- Around line 140-145: The defer's error path in the anonymous cleanup function
calls t.Errorf("Failed to shutdown trace provider") without including the
returned error; update the t.Errorf call inside the deferred function that wraps
Shutdown(ctx, tp) to include the err value (e.g., t.Errorf("Failed to shutdown
trace provider: %v", err)) so the Shutdown error from trace.TracerProvider is
logged for debugging.
- Around line 151-160: The test's sampling assertion is incomplete: when
expectedSample is true you only check span.SpanContext().IsValid(), which
doesn't ensure the span was actually sampled; update the true-branch to also
assert span.SpanContext().TraceFlags().IsSampled() (mirroring the false-branch
check), so when expectedSample is true you fail the test if
TraceFlags().IsSampled() is false while keeping the existing IsValid() check for
span.SpanContext().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9550d68f-106e-4b28-81a8-9e6a21fdcecc
📒 Files selected for processing (12)
charts/values.yamlcmd/hyperfleet-api/servecmd/cmd.godocs/config.mddocs/logging.mdpkg/config/flags.gopkg/config/loader.gopkg/config/loader_test.gopkg/config/logging.gopkg/config/logging_test.gopkg/logger/fields.gopkg/telemetry/otel.gopkg/telemetry/otel_test.go
💤 Files with no reviewable changes (3)
- pkg/config/loader.go
- pkg/config/logging_test.go
- pkg/config/loader_test.go
|
/ok-to-test |
|
@ldornele: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…lugin from Claude Code
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/telemetry/otel_test.go (1)
38-73: Consider addingt.Parallel()isolation or usingt.Setenv.The OTLP exporter test manipulates global environment variables. While tests in this file run sequentially by default, using
t.Setenv(Go 1.17+) would automatically handle cleanup and prevent interference if parallelism is enabled later.// Instead of: os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "...") defer os.Unsetenv("OTEL_EXPORTER_OTLP_ENDPOINT") // Consider: t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317")This is a minor improvement for maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/telemetry/otel_test.go` around lines 38 - 73, The test TestInitTraceProvider_OTLPExporter mutates global env vars with os.Setenv/os.Unsetenv; replace that pattern with t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317") to automatically handle cleanup and avoid cross-test interference, and optionally add t.Parallel() at the top of TestInitTraceProvider_OTLPExporter if you want the test to run concurrently; ensure references to InitTraceProvider and Shutdown remain unchanged and that the tracer assertions still follow the t.Setenv call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 38-73: The test TestInitTraceProvider_OTLPExporter mutates global
env vars with os.Setenv/os.Unsetenv; replace that pattern with
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317") to
automatically handle cleanup and avoid cross-test interference, and optionally
add t.Parallel() at the top of TestInitTraceProvider_OTLPExporter if you want
the test to run concurrently; ensure references to InitTraceProvider and
Shutdown remain unchanged and that the tracer assertions still follow the
t.Setenv call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d7e5e533-cfbb-43f3-8928-71044a158f42
📒 Files selected for processing (5)
cmd/hyperfleet-api/servecmd/cmd.godocs/config.mddocs/logging.mdpkg/logger/fields.gopkg/telemetry/otel_test.go
✅ Files skipped from review due to trivial changes (2)
- pkg/logger/fields.go
- docs/logging.md
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 52-62: The test currently calls Shutdown(ctx, tp) only after
assertions, which can leak the trace provider if an assertion fails; immediately
defer the cleanup right after creating the trace provider (tp) so Shutdown(ctx,
tp) is always executed—i.e., place a defer func() { _ = Shutdown(ctx, tp) }()
(or check/handle error inside defer) immediately after tp is
initialized/returned and before calling otel.Tracer("test") so the tracer, tp,
Shutdown, and ctx references are preserved and cleanup always runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b3955823-9df9-4051-9ce6-ee7b96f04b8f
📒 Files selected for processing (1)
pkg/telemetry/otel_test.go
|
/ok-to-test |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Around line 35-41: Update the vulnerable dependency
go.opentelemetry.io/otel/sdk from v1.38.0 to at least v1.40.0 in go.mod to
mitigate CVE-2026-24051; specifically change the module version string for
go.opentelemetry.io/otel/sdk to v1.40.0 (or later) and run `go mod tidy`/`go
get` to refresh the lockfile and ensure the build picks up the patched release
that uses the absolute /usr/sbin/ioreg path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 105558a0-d208-456d-90df-dd5e89435b26
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (3)
charts/templates/configmap.yamlcharts/values.yamlgo.mod
💤 Files with no reviewable changes (1)
- charts/templates/configmap.yaml
✅ Files skipped from review due to trivial changes (1)
- charts/values.yaml
| // Determine if tracing is enabled | ||
| // Precedence: TRACING_ENABLED (tracing standard) > config (env/flags) > default | ||
| var tracingEnabled bool | ||
| if tracingEnv := os.Getenv("TRACING_ENABLED"); tracingEnv != "" { |
There was a problem hiding this comment.
Can we align on using a single variable for enabling tracing?
I see the standard says TRACING_ENABLED and this may come from before standarising environment variables with prefix HYPERFLEET_
Then, the docs mention HYPERFLEET_LOGGING_OTEL_ENABLED
I think we should consolidate in a single one and keep it consistent across repositories.
AFAIK OTEL_* variables do not have the HYPERFLEET_ prefix since they are used directly by the OTEL libraries, but any other variable that we introduce should be prefixed by our standard convention.
There was a problem hiding this comment.
Got it! I’ll remove HYPERFLEET_LOGGING_OTEL_ENABLED then. I’ll keep the OTEL_ variables* for consistency with hyperfleet-sentinel. Does that make sense?
There was a problem hiding this comment.
Also to clarify, I meant to use HYPERFLEET_TRACING_ENABLED instead of simple TRACING_ENABLED
There was a problem hiding this comment.
HYPERFLEET_TRACING_ENABLED is not compliant with the tracing standards — we should use TRACING_ENABLED instead. See: tracing standards
docs/config.md
Outdated
| - ✅ Default values | ||
| **Core Configuration:** | ||
| - ✅ Database credentials set via `HYPERFLEET_DATABASE_*` environment variables | ||
| - ✅ Database SSL/TLS configured (`HYPERFLEET_DATABASE_SSL_MODE=require`) |
There was a problem hiding this comment.
This is specific per environment, e.g. when using Google Cloud SQL proxy the connection from the application may happen without TLS to the proxy, and then the proxy uses TLS to the Cloud SQL database.
There was a problem hiding this comment.
I've simplified config.md to be a clean technical reference. Removed the detailed checklists/tips and kept only the OpenTelemetry configuration tables (properties, env vars, defaults).
docs/config.md
Outdated
| **Core Configuration:** | ||
| - ✅ Database credentials set via `HYPERFLEET_DATABASE_*` environment variables | ||
| - ✅ Database SSL/TLS configured (`HYPERFLEET_DATABASE_SSL_MODE=require`) | ||
| - ✅ Server authentication enabled (`HYPERFLEET_SERVER_JWT_ENABLED=true`) |
There was a problem hiding this comment.
We don't have authentication yet for the API
There was a problem hiding this comment.
Got it! I've simplified config.md to be a clean technical reference. Removed the detailed checklists/tips and kept only the OpenTelemetry configuration tables (properties, env vars, defaults).
docs/config.md
Outdated
| - ✅ CLI flags (--kebab-case) | ||
| - ✅ Configuration files (YAML snake_case) | ||
| - ✅ Default values | ||
| **Core Configuration:** |
There was a problem hiding this comment.
This section overlaps with the contents at api-operator-guide.md
IMO we should keep the config.md for describing configs, and tips/checklist should go into the operator-guide
There was a problem hiding this comment.
Agreed! I've simplified config.md to be a clean technical reference. Removed the detailed checklists/tips and kept only the OpenTelemetry configuration tables (properties, env vars, defaults).
| serviceName := "hyperfleet-api" | ||
| if svcName := os.Getenv("OTEL_SERVICE_NAME"); svcName != "" { | ||
| serviceName = svcName | ||
| } | ||
|
|
||
| traceProvider, err := telemetry.InitTraceProvider(ctx, serviceName, api.Version) |
There was a problem hiding this comment.
Both sentinel and adapter have a name field
I wonder if we should add it to the API to avoid having this constant "hyperfleet-api" here
Then, here we are reading a variable that is otel specific.... I wonder if that should be done inside the telemetry.InitTraceProvider function, so main can become completely unaware of this setting
docs/logging.md
Outdated
| if environments.Environment().Config.Logging.OTel.Enabled { | ||
| samplingRate := environments.Environment().Config.Logging.OTel.SamplingRate | ||
| tp, err := telemetry.InitTraceProvider(ctx, "hyperfleet-api", api.Version, samplingRate) | ||
| // Precedence: TRACING_ENABLED (tracing standard) > config (env/flags) > default |
There was a problem hiding this comment.
I would limit the code that we put in docs to be absolute minimum, and just point to the go file.
If not, we create more work and increase the drift chances
There was a problem hiding this comment.
Good catch! I've minimized code duplication in logging.md
docs/logging.md
Outdated
| } | ||
| ``` | ||
|
|
||
| **Note:** This is a simplified example. The actual implementation differs in two key ways: |
There was a problem hiding this comment.
Same with this comments... if the code is already commented.... getting the comments to the documentation seems.... too much repetition. wdyt?
There was a problem hiding this comment.
Good catch! I've minimized it in logging.md
pkg/telemetry/otel.go
Outdated
| if otlpEndpoint := os.Getenv(envOtelExporterOtlpEndpoint); otlpEndpoint != "" { | ||
| protocol := os.Getenv(envOtelExporterOtlpProtocol) | ||
| switch strings.ToLower(protocol) { | ||
| case "http", "http/protobuf": |
There was a problem hiding this comment.
Category: Bug
The switch accepts bare "http" as equivalent to "http/protobuf" (line 52),
but per the OTEL
spec the valid
values for OTEL_EXPORTER_OTLP_PROTOCOL are grpc, http/protobuf, and
http/json. Bare "http" is ambiguous — someone setting it might expect
http/json but get http/protobuf silently.
Suggestion: either drop "http" from the case (let it fall through to the
default: branch which logs a warning and uses gRPC), or at minimum log a
deprecation warning when "http" is used:
case "http/protobuf":
exporter, err = otlptracehttp.New(ctx)
case "http":
logger.With(ctx, logger.FieldProtocol, protocol).Warn("Ambiguous protocol
'http', assuming 'http/protobuf'. Use 'http/protobuf' explicitly.")
exporter, err = otlptracehttp.New(ctx)There was a problem hiding this comment.
Got it — I removed "http" from the case, as accepting a bare "http" value is ambiguous. The protocol should be explicitly set to http/protobuf.
pkg/telemetry/otel.go
Outdated
| // - OTEL_TRACES_SAMPLER: sampler type (default: "parentbased_traceidratio") | ||
| // - OTEL_TRACES_SAMPLER_ARG: sampling rate 0.0-1.0 (default: 1.0) | ||
| // - OTEL_RESOURCE_ATTRIBUTES: additional resource attributes (k=v,k2=v2) | ||
| // - K8S_NAMESPACE: kubernetes namespace (added as k8s.namespace.name) |
There was a problem hiding this comment.
Category: Inconsistency
Both the InitTraceProvider docstring (line 43) and the OTelConfig
docstring in pkg/config/logging.go claim K8S_NAMESPACE is used to set
k8s.namespace.name, but the code never reads this env var.
resource.WithFromEnv() only parses OTEL_RESOURCE_ATTRIBUTES.
It should remove K8S_NAMESPACE from the docstrings (users should set it via
OTEL_RESOURCE_ATTRIBUTES=k8s.namespace.name=my-namespace).
There was a problem hiding this comment.
Mmm... why do we even need to send the namespace?
I mean, the sentinels will be all running in the same namespace, what value does it add?
Do we want to differentiate production namespaces from dev namespaces?
Or... is it to differentiate between dev environments?
There was a problem hiding this comment.
Got it! Removed K8S_NAMESPACE from pkg/telemetry/otel.go docstring. Users can set the Kubernetes namespace via OTEL_RESOURCE_ATTRIBUTES:
- export OTEL_RESOURCE_ATTRIBUTES=k8s.namespace.name=my-namespace
pkg/config/logging.go
Outdated
| OTel: OTelConfig{ | ||
| Enabled: false, | ||
| SamplingRate: 1.0, | ||
| Enabled: false, |
There was a problem hiding this comment.
Category: Standards
The [tracing standard](https://github.com/openshift-hyperfleet/architecture/bl
ob/main/hyperfleet/standards/tracing.md#configuration) defines
TRACING_ENABLED with a default of true, but the PR defaults to false
(via OTelConfig.Enabled: false in NewLoggingConfig()).
If this is intentional (e.g., tracing shouldn't be on unless explicitly
enabled), it might be worth either:
- Updating the tracing standard to reflect
falseas the default, or - Documenting the deviation in the PR description / code comment
Otherwise other components implementing the same standard might default to
true, leading to inconsistent cross-component behavior — which is exactly
what TRACING_ENABLED was designed to unify.
There was a problem hiding this comment.
Got it — setting OTelConfig.Enabled to true to align with the tracing standard default.
pkg/telemetry/otel.go
Outdated
| if parsedRate, err := strconv.ParseFloat(arg, 64); err == nil && parsedRate >= 0.0 && parsedRate <= 1.0 { | ||
| rate = parsedRate | ||
| } else { | ||
| logger.With(ctx, envOtelTracesSamplerArg, arg, "default", rate). |
There was a problem hiding this comment.
This produces log keys like {"OTEL_TRACES_SAMPLER_ARG": "...", "default":
1.0}, which is inconsistent with the rest of the PR where structured field
constants are used (logger.FieldProtocol, logger.FieldSampler, etc.). Consider
using the existing logger.FieldSamplingRate for the default value and a
dedicated field for the raw input:
| logger.With(ctx, envOtelTracesSamplerArg, arg, "default", rate). | |
| logger.With(ctx, logger.FieldSamplingRate, rate, "raw_value", arg). |
There was a problem hiding this comment.
Good catch! Fixing it!
Updates OpenTelemetry SDK and exporters from v1.38.0 to v1.40.0, addressing
potential security vulnerabilities and fixing CI test-integration failures
caused by missing dependency entries in go.sum.
Key updates:
- go.opentelemetry.io/otel: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/sdk: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/exporters/otlp/*: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/exporters/stdout/stdouttrace: v1.38.0 → v1.40.0
- google.golang.org/grpc: v1.75.0 → v1.78.0
Includes transitive dependency updates for protobuf, grpc-gateway, and
golang.org/x/{crypto,net,sys,text}.
All 551 unit tests pass.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
xueli181114
left a comment
There was a problem hiding this comment.
Developer review: inline follow-ups on consistency and operator-facing correctness.
docs/config.md
Outdated
| | `logging.output` | `HYPERFLEET_LOGGING_OUTPUT` | string | `stdout` | | ||
| | `logging.otel.enabled` | `HYPERFLEET_LOGGING_OTEL_ENABLED` | bool | `false` | | ||
| | `logging.otel.sampling_rate` | `HYPERFLEET_LOGGING_OTEL_SAMPLING_RATE` | float | `1.0` | | ||
| | `logging.otel.enabled` | `TRACING_ENABLED` | bool | `false` | |
There was a problem hiding this comment.
The default shown here (false) now conflicts with runtime default in pkg/config/logging.go (Enabled: true) and with the OpenTelemetry section above that states default true. Recommend aligning this table with actual runtime behavior to avoid deployment surprises.
There was a problem hiding this comment.
Great catch! Fixed in latest commit - updated all in config.md to show the correct default true value, now aligned with pkg/config/logging.go runtime behavior.
docs/config.md
Outdated
| @@ -423,7 +456,6 @@ All CLI flags and their corresponding configuration paths. | |||
| | `--log-format` | `logging.format` | string | | |||
| | `--log-output` | `logging.output` | string | | |||
| | `--log-otel-enabled` | `logging.otel.enabled` | bool | | |||
There was a problem hiding this comment.
This CLI flag appears stale. I couldn't find --log-otel-enabled in current flag registration (AddLoggingFlags), so this reference can mislead operators. Suggest removing this row or reintroducing the flag consistently.
There was a problem hiding this comment.
Good point! Removed the stale --log-otel-enabled flag reference from config.md
charts/values.yaml
Outdated
| # Configuration via standard environment variables. | ||
| # See: https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/standards/tracing.md#configuration | ||
| otel: | ||
| enabled: false |
There was a problem hiding this comment.
Potential behavior mismatch: chart default sets config.logging.otel.enabled=false while runtime default now resolves to tracing enabled (true) when no overrides are provided. If intentional, please document this Helm-specific deviation explicitly; otherwise align defaults for predictable behavior across deployment modes.
There was a problem hiding this comment.
Good point! Fixed in latest commit. Aligned Helm chart default to match runtime behavior.
go.mod
Outdated
| go.opentelemetry.io/otel/trace v1.38.0 | ||
| go.opentelemetry.io/otel v1.40.0 | ||
| go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.40.0 | ||
| go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.38.0 |
There was a problem hiding this comment.
Most OTEL deps were bumped to v1.40.0, but otlptracehttp remains v1.38.0. Mixing versions may still work, but it increases maintenance risk and can hide subtle incompatibilities. Consider aligning otlptracehttp to the same OTEL minor version set.
There was a problem hiding this comment.
Great catch! Updated otlptracehttp from v1.38.0 to v1.40.0 in latest commit.
xueli181114
left a comment
There was a problem hiding this comment.
Good direction — moving to standard OTEL env vars is the right call. A few things I noticed beyond what rh-amarin and rafabene already flagged.
cmd/hyperfleet-api/servecmd/cmd.go
Outdated
| } | ||
|
|
||
| // Update config to ensure middleware registration sees the final value | ||
| environments.Environment().Config.Logging.OTel.Enabled = tracingEnabled |
There was a problem hiding this comment.
Mutating the shared global config object at runtime could be surprising. If anything reads environments.Environment().Config.Logging.OTel.Enabled before this line (e.g. from another goroutine or during config dump at line 70 above), it would see the pre-override value.
In practice this runs sequentially at startup so it's fine today, but it's fragile — consider passing tracingEnabled forward explicitly rather than back-patching the config struct.
There was a problem hiding this comment.
Excellent point! Fixed in latest commit. Refactored to pass tracingEnabled explicitly instead of
back-patching the config struct.
Changes:
NewAPIServer()→NewAPIServer(tracingEnabled bool)routes()→routes(tracingEnabled bool)- Removed config mutation (
Config.Logging.OTel.Enabled = tracingEnabled) - Updated test helper to pass explicit value
| l.bindEnv("logging.output") | ||
| l.bindEnv("logging.otel.enabled") | ||
| l.bindEnv("logging.otel.sampling_rate") | ||
| l.bindEnv("logging.masking.enabled") |
There was a problem hiding this comment.
Removing these bindings means HYPERFLEET_LOGGING_OTEL_ENABLED is now silently ignored. Any existing deployment setting this env var (e.g. directly on pod spec, not through Helm) will stop working without warning.
Consider either:
- Keeping the binding so the config file path still works as the fallback layer, OR
- Adding a startup warning if
HYPERFLEET_LOGGING_OTEL_ENABLEDis set, telling users to migrate toTRACING_ENABLED
There was a problem hiding this comment.
Good point! We introduced a deprecation warning in latest commit.
What happens now:
- If
HYPERFLEET_LOGGING_OTEL_ENABLEDis set → Warning logged at startup - Warning message clearly shows:
- Deprecated variable name
- Replacement variable (
TRACING_ENABLED)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Motivation
TRACING_ENABLED(Tracing standard) for unified control across hyperfleet-api, hyperfleet-sentinel, and adaptersChanges
Variable Migration
HYPERFLEET_LOGGING_OTEL_SAMPLING_RATEOTEL_TRACES_SAMPLER_ARGOTEL_SERVICE_NAMEOTEL_EXPORTER_OTLP_ENDPOINTOTEL_EXPORTER_OTLP_PROTOCOLOTEL_TRACES_SAMPLEROTEL_RESOURCE_ATTRIBUTESBackward Compatibility:
HYPERFLEET_LOGGING_OTEL_ENABLEDstill works via Viperlogging.otel.enabled: truecontinue to workTRACING_ENABLEDtakes precedence when setVariable Precedence
TRACING_ENABLED (Tracing standard)
↓
HYPERFLEET_LOGGING_OTEL_ENABLED (Viper env var)
↓
config.yaml: logging.otel.enabled
↓
Default (false)
Documentation
Summary by CodeRabbit
New Features
Changes
Documentation
Tests