Skip to content

HYPERFLEET-762: Add OpenTelemetry configuration to standard environme…#89

Open
ldornele wants to merge 9 commits intoopenshift-hyperfleet:mainfrom
ldornele:HYPERFLEET-762
Open

HYPERFLEET-762: Add OpenTelemetry configuration to standard environme…#89
ldornele wants to merge 9 commits intoopenshift-hyperfleet:mainfrom
ldornele:HYPERFLEET-762

Conversation

@ldornele
Copy link
Contributor

@ldornele ldornele commented Mar 21, 2026

Motivation

  • Standards Compliance: Use official OpenTelemetry environment variables instead of custom ones
  • Cross-Component Consistency: Implement TRACING_ENABLED (Tracing standard) for unified control across hyperfleet-api, hyperfleet-sentinel, and adapters
  • Ecosystem Integration: Enable seamless integration with OpenTelemetry Operator, collectors, and other OTEL-compatible tools
  • Simplified Configuration: Reduce confusion by using industry-standard variable names

Changes

  • Add TRACING_ENABLED as primary toggle (Tracing standard override)
  • Replace HYPERFLEET_LOGGING_OTEL_SAMPLING_RATE with OTEL_TRACES_SAMPLER_ARG
  • Implement standard OTEL_* environment variables:
    • OTEL_SERVICE_NAME
    • OTEL_EXPORTER_OTLP_ENDPOINT
    • OTEL_EXPORTER_OTLP_PROTOCOL
    • OTEL_TRACES_SAMPLER
    • OTEL_TRACES_SAMPLER_ARG
    • OTEL_RESOURCE_ATTRIBUTES
  • Update servecmd to respect TRACING_ENABLED > config precedence
  • Update logging.md with OpenTelemetry environment variables table
  • Update config.md with OpenTelemetry configuration section

Variable Migration

Old (Removed) New (Standard)
HYPERFLEET_LOGGING_OTEL_SAMPLING_RATE OTEL_TRACES_SAMPLER_ARG
N/A OTEL_SERVICE_NAME
N/A OTEL_EXPORTER_OTLP_ENDPOINT
N/A OTEL_EXPORTER_OTLP_PROTOCOL
N/A OTEL_TRACES_SAMPLER
N/A OTEL_RESOURCE_ATTRIBUTES

Backward Compatibility:

  • HYPERFLEET_LOGGING_OTEL_ENABLED still works via Viper
  • Existing config files with logging.otel.enabled: true continue to work
  • TRACING_ENABLED takes precedence when set

Variable Precedence

TRACING_ENABLED (Tracing standard)

HYPERFLEET_LOGGING_OTEL_ENABLED (Viper env var)

config.yaml: logging.otel.enabled

Default (false)

Documentation

Summary by CodeRabbit

  • New Features

    • Support for standard OpenTelemetry environment variables and TRACING_ENABLED to control tracing
  • Changes

    • Removed default sampling-rate from config/CLI/Helm; sampling now driven by OTEL_TRACES_SAMPLER/OTEL_TRACES_SAMPLER_ARG
    • Tracing now derives service name from OTEL_SERVICE_NAME and logs service metadata
    • Removed legacy sampling CLI/flag and config defaults
  • Documentation

    • Expanded OpenTelemetry guidance, precedence rules, and updated deployment checklist
  • Tests

    • Added telemetry tests for exporter selection, sampler behavior, and shutdown handling

@openshift-ci openshift-ci bot requested review from jsell-rh and pnguyen44 March 21, 2026 20:25
@openshift-ci
Copy link

openshift-ci bot commented Mar 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vkareh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link

openshift-ci bot commented Mar 21, 2026

Hi @ldornele. Thanks for your PR.

I'm waiting for a openshift-hyperfleet member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The PR removes the logging.otel.sampling_rate configuration, bindings, CLI flag, and related tests; introduces TRACING_ENABLED env var which, when present, overrides configured tracing enablement and is written back into the resolved application config; updates OpenTelemetry initialization to derive service name and sampler from standard OTEL_* env vars (removing samplingRate parameter), chooses OTLP exporter when OTEL_EXPORTER_OTLP_ENDPOINT is set (HTTP or gRPC by protocol), sets TextMapPropagator to TraceContext+Baggage, adjusts logging fields, adds telemetry tests, and updates docs/Helm/flags accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant Startup as Server Startup (cmd)
  participant Env as Environment Variables (TRACING_ENABLED, OTEL_*)
  participant Config as Application Config (logging.otel.enabled)
  participant Telemetry as telemetry.InitTraceProvider
  participant Exporter as OTLP/Stdout Exporter
  Startup->>Env: Read TRACING_ENABLED and OTEL_* vars
  Startup->>Config: Load config (logging.otel.enabled)
  Startup->>Startup: Resolve tracingEnabled (TRACING_ENABLED overrides config if set)
  Startup->>Config: Set Config.Logging.OTel.Enabled = tracingEnabled
  Startup->>Telemetry: InitTraceProvider(ctx, serviceName, version)
  Telemetry->>Env: Read OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_TRACES_SAMPLER, OTEL_TRACES_SAMPLER_ARG, OTEL_SERVICE_NAME
  Telemetry->>Exporter: Create OTLP exporter (HTTP or gRPC) if endpoint set, else stdout exporter
  Telemetry->>Telemetry: Configure sampler from OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG
  Telemetry->>Telemetry: Set TextMapPropagator (TraceContext + Baggage)
  Telemetry->>Startup: Return TracerProvider
Loading
sequenceDiagram
  participant OldStartup as Previous Startup Flow
  participant OldConfig as App Config (logging.otel.sampling_rate)
  participant OldTelemetry as InitTraceProvider(service, version, samplingRate)
  participant OldExporter as stdouttrace Exporter
  OldStartup->>OldConfig: Read logging.otel.sampling_rate (default 1.0)
  OldStartup->>OldTelemetry: InitTraceProvider(ctx, serviceName, version, samplingRate)
  OldTelemetry->>OldExporter: Create stdouttrace exporter
  OldTelemetry->>OldStartup: Return TracerProvider
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding OpenTelemetry configuration to standard environment variables, which is the primary focus of all modifications across documentation, configuration, and code files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
pkg/logger/fields.go (1)

45-45: Inconsistent field naming convention.

FieldTracingEnabled uses uppercase "TRACING_ENABLED" while all other field constants in this file use snake_case (e.g., FieldOTelEnabled = "otel_enabled", FieldSamplingRate = "sampling_rate"). This inconsistency could cause confusion in log queries.

♻️ Suggested fix for consistency
-	FieldTracingEnabled   = "TRACING_ENABLED"
+	FieldTracingEnabled   = "tracing_enabled"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/logger/fields.go` at line 45, FieldTracingEnabled uses uppercase
"TRACING_ENABLED" which is inconsistent with the snake_case convention used by
other constants (e.g., FieldOTelEnabled, FieldSamplingRate); rename or change
the constant value to snake_case—replace the value of FieldTracingEnabled with
"tracing_enabled" (or rename the identifier to match project convention) so it
matches the pattern used by other fields in pkg/logger/fields.go.
pkg/telemetry/otel_test.go (2)

140-145: Minor: Error message omits the actual error.

The shutdown error message doesn't include the error value, making debugging harder.

♻️ Suggested fix
 			defer func(ctx context.Context, tp *trace.TracerProvider) {
 				err := Shutdown(ctx, tp)
 				if err != nil {
-					t.Errorf("Failed to shutdown trace provider")
+					t.Errorf("Failed to shutdown trace provider: %v", err)
 				}
 			}(ctx, tp)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 140 - 145, The defer's error path in
the anonymous cleanup function calls t.Errorf("Failed to shutdown trace
provider") without including the returned error; update the t.Errorf call inside
the deferred function that wraps Shutdown(ctx, tp) to include the err value
(e.g., t.Errorf("Failed to shutdown trace provider: %v", err)) so the Shutdown
error from trace.TracerProvider is logged for debugging.

151-160: Incomplete sampling assertion for expectedSample=true cases.

When expectedSample is true, the test only verifies span.SpanContext().IsValid(). This doesn't confirm that the span was actually sampled—a span context can be valid but not sampled. Consider also checking IsSampled() for consistency with the expectedSample=false branch.

♻️ Suggested fix
 			if tt.expectedSample {
-				if !span.SpanContext().IsValid() {
-					t.Error("Expected valid span context for sampling=true")
+				if !span.SpanContext().IsValid() || !span.SpanContext().TraceFlags().IsSampled() {
+					t.Error("Expected valid and sampled span context for sampling=true")
 				}
 			} else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 151 - 160, The test's sampling
assertion is incomplete: when expectedSample is true you only check
span.SpanContext().IsValid(), which doesn't ensure the span was actually
sampled; update the true-branch to also assert
span.SpanContext().TraceFlags().IsSampled() (mirroring the false-branch check),
so when expectedSample is true you fail the test if TraceFlags().IsSampled() is
false while keeping the existing IsValid() check for span.SpanContext().
docs/logging.md (1)

377-399: Code snippet differs from actual implementation.

The initialization snippet at lines 377-399 simplifies error handling compared to the actual implementation in cmd/hyperfleet-api/servecmd/cmd.go. Notably:

  • Line 380 ignores the strconv.ParseBool error, whereas the actual code logs a warning and falls back to config.
  • Line 395 uses defer tp.Shutdown(...) directly, whereas the actual code uses telemetry.Shutdown() with proper timeout context.

This is acceptable for a conceptual illustration, but consider adding a note that this is a simplified example.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/logging.md` around lines 377 - 399, The snippet in the docs simplifies
error handling compared to the real implementation: update the text to note it’s
a simplified example and call out the behavioral differences — specifically that
strconv.ParseBool errors are logged via logger.WithError and fall back to
environments.Environment().Config.Logging.OTel.Enabled instead of being ignored,
and that the real code calls telemetry.Shutdown(ctx, tp) with a timeout context
rather than deferring tp.Shutdown(...) directly after
telemetry.InitTraceProvider; mention the symbols strconv.ParseBool,
logger.WithError, telemetry.InitTraceProvider, tp.Shutdown and
telemetry.Shutdown so readers can map the doc snippet to the actual
implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/config.md`:
- Line 691: Update the checklist entry that currently lists
HYPERFLEET_CLUSTER_ADAPTERS and HYPERFLEET_NODEPOOL_ADAPTERS to use the
canonical environment variable names HYPERFLEET_ADAPTERS_REQUIRED_CLUSTER and
HYPERFLEET_ADAPTERS_REQUIRED_NODEPOOL so it matches the Complete Reference (the
entries referenced at lines 466-467); locate the checklist line in
docs/config.md and replace the two variable names accordingly to ensure
documentation consistency.

---

Nitpick comments:
In `@docs/logging.md`:
- Around line 377-399: The snippet in the docs simplifies error handling
compared to the real implementation: update the text to note it’s a simplified
example and call out the behavioral differences — specifically that
strconv.ParseBool errors are logged via logger.WithError and fall back to
environments.Environment().Config.Logging.OTel.Enabled instead of being ignored,
and that the real code calls telemetry.Shutdown(ctx, tp) with a timeout context
rather than deferring tp.Shutdown(...) directly after
telemetry.InitTraceProvider; mention the symbols strconv.ParseBool,
logger.WithError, telemetry.InitTraceProvider, tp.Shutdown and
telemetry.Shutdown so readers can map the doc snippet to the actual
implementation.

In `@pkg/logger/fields.go`:
- Line 45: FieldTracingEnabled uses uppercase "TRACING_ENABLED" which is
inconsistent with the snake_case convention used by other constants (e.g.,
FieldOTelEnabled, FieldSamplingRate); rename or change the constant value to
snake_case—replace the value of FieldTracingEnabled with "tracing_enabled" (or
rename the identifier to match project convention) so it matches the pattern
used by other fields in pkg/logger/fields.go.

In `@pkg/telemetry/otel_test.go`:
- Around line 140-145: The defer's error path in the anonymous cleanup function
calls t.Errorf("Failed to shutdown trace provider") without including the
returned error; update the t.Errorf call inside the deferred function that wraps
Shutdown(ctx, tp) to include the err value (e.g., t.Errorf("Failed to shutdown
trace provider: %v", err)) so the Shutdown error from trace.TracerProvider is
logged for debugging.
- Around line 151-160: The test's sampling assertion is incomplete: when
expectedSample is true you only check span.SpanContext().IsValid(), which
doesn't ensure the span was actually sampled; update the true-branch to also
assert span.SpanContext().TraceFlags().IsSampled() (mirroring the false-branch
check), so when expectedSample is true you fail the test if
TraceFlags().IsSampled() is false while keeping the existing IsValid() check for
span.SpanContext().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9550d68f-106e-4b28-81a8-9e6a21fdcecc

📥 Commits

Reviewing files that changed from the base of the PR and between 34ae499 and 6e087ee.

📒 Files selected for processing (12)
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/config.md
  • docs/logging.md
  • pkg/config/flags.go
  • pkg/config/loader.go
  • pkg/config/loader_test.go
  • pkg/config/logging.go
  • pkg/config/logging_test.go
  • pkg/logger/fields.go
  • pkg/telemetry/otel.go
  • pkg/telemetry/otel_test.go
💤 Files with no reviewable changes (3)
  • pkg/config/loader.go
  • pkg/config/logging_test.go
  • pkg/config/loader_test.go

@ldornele
Copy link
Contributor Author

/ok-to-test

@openshift-ci
Copy link

openshift-ci bot commented Mar 21, 2026

@ldornele: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/telemetry/otel_test.go (1)

38-73: Consider adding t.Parallel() isolation or using t.Setenv.

The OTLP exporter test manipulates global environment variables. While tests in this file run sequentially by default, using t.Setenv (Go 1.17+) would automatically handle cleanup and prevent interference if parallelism is enabled later.

// Instead of:
os.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "...")
defer os.Unsetenv("OTEL_EXPORTER_OTLP_ENDPOINT")

// Consider:
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317")

This is a minor improvement for maintainability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/telemetry/otel_test.go` around lines 38 - 73, The test
TestInitTraceProvider_OTLPExporter mutates global env vars with
os.Setenv/os.Unsetenv; replace that pattern with
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317") to
automatically handle cleanup and avoid cross-test interference, and optionally
add t.Parallel() at the top of TestInitTraceProvider_OTLPExporter if you want
the test to run concurrently; ensure references to InitTraceProvider and
Shutdown remain unchanged and that the tracer assertions still follow the
t.Setenv call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 38-73: The test TestInitTraceProvider_OTLPExporter mutates global
env vars with os.Setenv/os.Unsetenv; replace that pattern with
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://fake-otel-collector:4317") to
automatically handle cleanup and avoid cross-test interference, and optionally
add t.Parallel() at the top of TestInitTraceProvider_OTLPExporter if you want
the test to run concurrently; ensure references to InitTraceProvider and
Shutdown remain unchanged and that the tracer assertions still follow the
t.Setenv call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d7e5e533-cfbb-43f3-8928-71044a158f42

📥 Commits

Reviewing files that changed from the base of the PR and between 6e087ee and e95d947.

📒 Files selected for processing (5)
  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/config.md
  • docs/logging.md
  • pkg/logger/fields.go
  • pkg/telemetry/otel_test.go
✅ Files skipped from review due to trivial changes (2)
  • pkg/logger/fields.go
  • docs/logging.md

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/telemetry/otel_test.go`:
- Around line 52-62: The test currently calls Shutdown(ctx, tp) only after
assertions, which can leak the trace provider if an assertion fails; immediately
defer the cleanup right after creating the trace provider (tp) so Shutdown(ctx,
tp) is always executed—i.e., place a defer func() { _ = Shutdown(ctx, tp) }()
(or check/handle error inside defer) immediately after tp is
initialized/returned and before calling otel.Tracer("test") so the tracer, tp,
Shutdown, and ctx references are preserved and cleanup always runs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3955823-9df9-4051-9ce6-ee7b96f04b8f

📥 Commits

Reviewing files that changed from the base of the PR and between e95d947 and b804696.

📒 Files selected for processing (1)
  • pkg/telemetry/otel_test.go

@tirthct
Copy link

tirthct commented Mar 23, 2026

/ok-to-test

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Around line 35-41: Update the vulnerable dependency
go.opentelemetry.io/otel/sdk from v1.38.0 to at least v1.40.0 in go.mod to
mitigate CVE-2026-24051; specifically change the module version string for
go.opentelemetry.io/otel/sdk to v1.40.0 (or later) and run `go mod tidy`/`go
get` to refresh the lockfile and ensure the build picks up the patched release
that uses the absolute /usr/sbin/ioreg path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 105558a0-d208-456d-90df-dd5e89435b26

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd7790 and 9d50749.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (3)
  • charts/templates/configmap.yaml
  • charts/values.yaml
  • go.mod
💤 Files with no reviewable changes (1)
  • charts/templates/configmap.yaml
✅ Files skipped from review due to trivial changes (1)
  • charts/values.yaml

// Determine if tracing is enabled
// Precedence: TRACING_ENABLED (tracing standard) > config (env/flags) > default
var tracingEnabled bool
if tracingEnv := os.Getenv("TRACING_ENABLED"); tracingEnv != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we align on using a single variable for enabling tracing?

I see the standard says TRACING_ENABLED and this may come from before standarising environment variables with prefix HYPERFLEET_

Then, the docs mention HYPERFLEET_LOGGING_OTEL_ENABLED

I think we should consolidate in a single one and keep it consistent across repositories.

AFAIK OTEL_* variables do not have the HYPERFLEET_ prefix since they are used directly by the OTEL libraries, but any other variable that we introduce should be prefixed by our standard convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I’ll remove HYPERFLEET_LOGGING_OTEL_ENABLED then. I’ll keep the OTEL_ variables* for consistency with hyperfleet-sentinel. Does that make sense?

Copy link
Contributor

@rh-amarin rh-amarin Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to clarify, I meant to use HYPERFLEET_TRACING_ENABLED instead of simple TRACING_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HYPERFLEET_TRACING_ENABLED is not compliant with the tracing standards — we should use TRACING_ENABLED instead. See: tracing standards

docs/config.md Outdated
- ✅ Default values
**Core Configuration:**
- ✅ Database credentials set via `HYPERFLEET_DATABASE_*` environment variables
- ✅ Database SSL/TLS configured (`HYPERFLEET_DATABASE_SSL_MODE=require`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific per environment, e.g. when using Google Cloud SQL proxy the connection from the application may happen without TLS to the proxy, and then the proxy uses TLS to the Cloud SQL database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified config.md to be a clean technical reference. Removed the detailed checklists/tips and kept only the OpenTelemetry configuration tables (properties, env vars, defaults).

docs/config.md Outdated
**Core Configuration:**
- ✅ Database credentials set via `HYPERFLEET_DATABASE_*` environment variables
- ✅ Database SSL/TLS configured (`HYPERFLEET_DATABASE_SSL_MODE=require`)
- ✅ Server authentication enabled (`HYPERFLEET_SERVER_JWT_ENABLED=true`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have authentication yet for the API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I've simplified config.md to be a clean technical reference. Removed the detailed checklists/tips and kept only the OpenTelemetry configuration tables (properties, env vars, defaults).

docs/config.md Outdated
- ✅ CLI flags (--kebab-case)
- ✅ Configuration files (YAML snake_case)
- ✅ Default values
**Core Configuration:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section overlaps with the contents at api-operator-guide.md

IMO we should keep the config.md for describing configs, and tips/checklist should go into the operator-guide

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I've simplified config.md to be a clean technical reference. Removed the detailed checklists/tips and kept only the OpenTelemetry configuration tables (properties, env vars, defaults).

Comment on lines +98 to +103
serviceName := "hyperfleet-api"
if svcName := os.Getenv("OTEL_SERVICE_NAME"); svcName != "" {
serviceName = svcName
}

traceProvider, err := telemetry.InitTraceProvider(ctx, serviceName, api.Version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both sentinel and adapter have a name field
I wonder if we should add it to the API to avoid having this constant "hyperfleet-api" here

Then, here we are reading a variable that is otel specific.... I wonder if that should be done inside the telemetry.InitTraceProvider function, so main can become completely unaware of this setting

docs/logging.md Outdated
if environments.Environment().Config.Logging.OTel.Enabled {
samplingRate := environments.Environment().Config.Logging.OTel.SamplingRate
tp, err := telemetry.InitTraceProvider(ctx, "hyperfleet-api", api.Version, samplingRate)
// Precedence: TRACING_ENABLED (tracing standard) > config (env/flags) > default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would limit the code that we put in docs to be absolute minimum, and just point to the go file.

If not, we create more work and increase the drift chances

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've minimized code duplication in logging.md

docs/logging.md Outdated
}
```

**Note:** This is a simplified example. The actual implementation differs in two key ways:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this comments... if the code is already commented.... getting the comments to the documentation seems.... too much repetition. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've minimized it in logging.md

if otlpEndpoint := os.Getenv(envOtelExporterOtlpEndpoint); otlpEndpoint != "" {
protocol := os.Getenv(envOtelExporterOtlpProtocol)
switch strings.ToLower(protocol) {
case "http", "http/protobuf":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Bug

The switch accepts bare "http" as equivalent to "http/protobuf" (line 52),
but per the OTEL
spec
the valid
values for OTEL_EXPORTER_OTLP_PROTOCOL are grpc, http/protobuf, and
http/json. Bare "http" is ambiguous — someone setting it might expect
http/json but get http/protobuf silently.

Suggestion: either drop "http" from the case (let it fall through to the
default: branch which logs a warning and uses gRPC), or at minimum log a
deprecation warning when "http" is used:

case "http/protobuf":
    exporter, err = otlptracehttp.New(ctx)
case "http":
    logger.With(ctx, logger.FieldProtocol, protocol).Warn("Ambiguous protocol
'http', assuming 'http/protobuf'. Use 'http/protobuf' explicitly.")
    exporter, err = otlptracehttp.New(ctx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it — I removed "http" from the case, as accepting a bare "http" value is ambiguous. The protocol should be explicitly set to http/protobuf.

// - OTEL_TRACES_SAMPLER: sampler type (default: "parentbased_traceidratio")
// - OTEL_TRACES_SAMPLER_ARG: sampling rate 0.0-1.0 (default: 1.0)
// - OTEL_RESOURCE_ATTRIBUTES: additional resource attributes (k=v,k2=v2)
// - K8S_NAMESPACE: kubernetes namespace (added as k8s.namespace.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Inconsistency

Both the InitTraceProvider docstring (line 43) and the OTelConfig
docstring in pkg/config/logging.go claim K8S_NAMESPACE is used to set
k8s.namespace.name, but the code never reads this env var.
resource.WithFromEnv() only parses OTEL_RESOURCE_ATTRIBUTES.

It should remove K8S_NAMESPACE from the docstrings (users should set it via
OTEL_RESOURCE_ATTRIBUTES=k8s.namespace.name=my-namespace).

Copy link
Contributor

@rh-amarin rh-amarin Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm... why do we even need to send the namespace?
I mean, the sentinels will be all running in the same namespace, what value does it add?
Do we want to differentiate production namespaces from dev namespaces?
Or... is it to differentiate between dev environments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Removed K8S_NAMESPACE from pkg/telemetry/otel.go docstring. Users can set the Kubernetes namespace via OTEL_RESOURCE_ATTRIBUTES:

  • export OTEL_RESOURCE_ATTRIBUTES=k8s.namespace.name=my-namespace

OTel: OTelConfig{
Enabled: false,
SamplingRate: 1.0,
Enabled: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category: Standards

The [tracing standard](https://github.com/openshift-hyperfleet/architecture/bl
ob/main/hyperfleet/standards/tracing.md#configuration) defines
TRACING_ENABLED with a default of true, but the PR defaults to false
(via OTelConfig.Enabled: false in NewLoggingConfig()).

If this is intentional (e.g., tracing shouldn't be on unless explicitly
enabled), it might be worth either:

  1. Updating the tracing standard to reflect false as the default, or
  2. Documenting the deviation in the PR description / code comment

Otherwise other components implementing the same standard might default to
true, leading to inconsistent cross-component behavior — which is exactly
what TRACING_ENABLED was designed to unify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it — setting OTelConfig.Enabled to true to align with the tracing standard default.

if parsedRate, err := strconv.ParseFloat(arg, 64); err == nil && parsedRate >= 0.0 && parsedRate <= 1.0 {
rate = parsedRate
} else {
logger.With(ctx, envOtelTracesSamplerArg, arg, "default", rate).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces log keys like {"OTEL_TRACES_SAMPLER_ARG": "...", "default":
1.0}, which is inconsistent with the rest of the PR where structured field
constants are used (logger.FieldProtocol, logger.FieldSampler, etc.). Consider
using the existing logger.FieldSamplingRate for the default value and a
dedicated field for the raw input:

Suggested change
logger.With(ctx, envOtelTracesSamplerArg, arg, "default", rate).
logger.With(ctx, logger.FieldSamplingRate, rate, "raw_value", arg).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixing it!

ldornele and others added 2 commits March 23, 2026 18:31
Updates OpenTelemetry SDK and exporters from v1.38.0 to v1.40.0, addressing
potential security vulnerabilities and fixing CI test-integration failures
caused by missing dependency entries in go.sum.

Key updates:
- go.opentelemetry.io/otel: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/sdk: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/exporters/otlp/*: v1.38.0 → v1.40.0
- go.opentelemetry.io/otel/exporters/stdout/stdouttrace: v1.38.0 → v1.40.0
- google.golang.org/grpc: v1.75.0 → v1.78.0

Includes transitive dependency updates for protobuf, grpc-gateway, and
golang.org/x/{crypto,net,sys,text}.

All 551 unit tests pass.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Collaborator

@xueli181114 xueli181114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Developer review: inline follow-ups on consistency and operator-facing correctness.

docs/config.md Outdated
| `logging.output` | `HYPERFLEET_LOGGING_OUTPUT` | string | `stdout` |
| `logging.otel.enabled` | `HYPERFLEET_LOGGING_OTEL_ENABLED` | bool | `false` |
| `logging.otel.sampling_rate` | `HYPERFLEET_LOGGING_OTEL_SAMPLING_RATE` | float | `1.0` |
| `logging.otel.enabled` | `TRACING_ENABLED` | bool | `false` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default shown here (false) now conflicts with runtime default in pkg/config/logging.go (Enabled: true) and with the OpenTelemetry section above that states default true. Recommend aligning this table with actual runtime behavior to avoid deployment surprises.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Fixed in latest commit - updated all in config.md to show the correct default true value, now aligned with pkg/config/logging.go runtime behavior.

docs/config.md Outdated
@@ -423,7 +456,6 @@ All CLI flags and their corresponding configuration paths.
| `--log-format` | `logging.format` | string |
| `--log-output` | `logging.output` | string |
| `--log-otel-enabled` | `logging.otel.enabled` | bool |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CLI flag appears stale. I couldn't find --log-otel-enabled in current flag registration (AddLoggingFlags), so this reference can mislead operators. Suggest removing this row or reintroducing the flag consistently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Removed the stale --log-otel-enabled flag reference from config.md

# Configuration via standard environment variables.
# See: https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/standards/tracing.md#configuration
otel:
enabled: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential behavior mismatch: chart default sets config.logging.otel.enabled=false while runtime default now resolves to tracing enabled (true) when no overrides are provided. If intentional, please document this Helm-specific deviation explicitly; otherwise align defaults for predictable behavior across deployment modes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Fixed in latest commit. Aligned Helm chart default to match runtime behavior.

go.mod Outdated
go.opentelemetry.io/otel/trace v1.38.0
go.opentelemetry.io/otel v1.40.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.40.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.38.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most OTEL deps were bumped to v1.40.0, but otlptracehttp remains v1.38.0. Mixing versions may still work, but it increases maintenance risk and can hide subtle incompatibilities. Consider aligning otlptracehttp to the same OTEL minor version set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Updated otlptracehttp from v1.38.0 to v1.40.0 in latest commit.

Copy link
Collaborator

@xueli181114 xueli181114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good direction — moving to standard OTEL env vars is the right call. A few things I noticed beyond what rh-amarin and rafabene already flagged.

}

// Update config to ensure middleware registration sees the final value
environments.Environment().Config.Logging.OTel.Enabled = tracingEnabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the shared global config object at runtime could be surprising. If anything reads environments.Environment().Config.Logging.OTel.Enabled before this line (e.g. from another goroutine or during config dump at line 70 above), it would see the pre-override value.

In practice this runs sequentially at startup so it's fine today, but it's fragile — consider passing tracingEnabled forward explicitly rather than back-patching the config struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point! Fixed in latest commit. Refactored to pass tracingEnabled explicitly instead of
back-patching the config struct.

Changes:

  • NewAPIServer()NewAPIServer(tracingEnabled bool)
  • routes()routes(tracingEnabled bool)
  • Removed config mutation (Config.Logging.OTel.Enabled = tracingEnabled)
  • Updated test helper to pass explicit value

l.bindEnv("logging.output")
l.bindEnv("logging.otel.enabled")
l.bindEnv("logging.otel.sampling_rate")
l.bindEnv("logging.masking.enabled")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these bindings means HYPERFLEET_LOGGING_OTEL_ENABLED is now silently ignored. Any existing deployment setting this env var (e.g. directly on pod spec, not through Helm) will stop working without warning.

Consider either:

  • Keeping the binding so the config file path still works as the fallback layer, OR
  • Adding a startup warning if HYPERFLEET_LOGGING_OTEL_ENABLED is set, telling users to migrate to TRACING_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We introduced a deprecation warning in latest commit.

What happens now:

  • If HYPERFLEET_LOGGING_OTEL_ENABLED is set → Warning logged at startup
  • Warning message clearly shows:
    • Deprecated variable name
    • Replacement variable (TRACING_ENABLED)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants