Skip to content

CI observability enhancements.#1193

Open
jglogan wants to merge 14 commits intoapple:mainfrom
jglogan:ci-observability
Open

CI observability enhancements.#1193
jglogan wants to merge 14 commits intoapple:mainfrom
jglogan:ci-observability

Conversation

@jglogan
Copy link
Contributor

@jglogan jglogan commented Feb 11, 2026

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Motivation and Context

Fix CI builds, add observability.

  • Currently we don't collect logs on CI builds, and we don't have permission to run the log command there. This PR adds a --log-root parameter for container system start, which gets propagated everywhere via the CONTAINER_LOG_ROOT variable, similarly to what we do for --app-root and --install-root.
  • Use FilePath from swift-system for the log root. Foundation URL is a bit of a footgun for filesystem paths, so unless we identify a showstopper, we should incrementally transition to this type everywhere except where we really need network URLs.
  • Set log root for the CI test phase, and archive/upload logs when tests fail.
  • Output the hostname of the CI runner at the start of the test phase so we can identify runner-specific issues where they exist.
  • Breaking change to CLI output - plumb log root into container system status, rework the command for consistency with resource list/inspect commands (i.e. table and JSON output), and add a unit test.
  • Adds command reference documentation for --log-root.

Testing

  • Tested locally
  • Added/updated tests
  • Added/updated docs

@jglogan jglogan changed the base branch from ci-observability to main February 11, 2026 04:47
@Ronitsabhaya75
Copy link
Contributor

@jglogan can you have look at #1167 I have added retry logic for api server so that we can atleast 5 tries during install kernel.

I saw CI has same failure. so i added retry logic instead of 1 time trying install kernel added 5 times.

lemme know if this fix can help

@jglogan
Copy link
Contributor Author

jglogan commented Feb 12, 2026

@Ronitsabhaya75 Retrying around undiagnosed problems is not the answer, and in this case you can retry a million times and it still won't work.

@Ronitsabhaya75
Copy link
Contributor

@Ronitsabhaya75 Retrying around real problems is not the answer, and in this case you can retry a million times and it still won't work.

let me see what can potential issue for installing-kernel can be. thats where CI has been failing all time.

@jglogan
Copy link
Contributor Author

jglogan commented Feb 12, 2026

I am committing a likely fix. The problem has nothing to do with installing the kernel.

@Ronitsabhaya75
Copy link
Contributor

Ronitsabhaya75 commented Feb 12, 2026

Verifying apiserver is running...
make: *** [install-kernel] Error 1
tests failed with status: 2
Removing data directory /Users/runner/actions-runner/_work/_temp/tmp.Moka3ycaZe
Error: Process completed with exit code 2.

yeah time out issue (wording mistake)

@jglogan jglogan force-pushed the ci-observability branch 3 times, most recently from 16eb03e to 7534596 Compare February 12, 2026 04:32
@Ronitsabhaya75
Copy link
Contributor

@jglogan do you think this is good idea adding endpoint that captures the system state like running containers,
network status, service health at any point.

then calling it automatically before operation when CI is failing ( like install kernel stage).

this can give us snapshot of before and after for comparision.

@Ronitsabhaya75
Copy link
Contributor

Hey @jglogan, I was thinking about this last night after our conversation. You're right that retries won't fix this but I think I might have an idea about what's actually going wrong.

I'm wondering if the install-kernel timeout is actually a race condition. Like, we're trying to install the kernel before the API server is fully ready to accept XPC connections. The service might be "started" but not fully initialized yet.

Instead of retrying or just increasing the timeout, what if we add a proper readiness check? And actually, your new JSON status output makes this really easy to implement:

install-kernel: start-system
	@echo "Waiting for system to be ready..."
	@for i in 1 2 3 4 5; do \
		if $(CONTAINER) system status --output json | jq -e '.apiServerRunning == true' > /dev/null 2>&1; then \
			echo "API server ready"; \
			break; \
		fi; \
		echo "Waiting for API server (attempt $$i/5)..."; \
		sleep 2; \
	done
	$(CONTAINER) kernel install

sample output

$ make install-kernel
Waiting for system to be ready...
Waiting for API server (attempt 1/5)...
Waiting for API server (attempt 2/5)...
Waiting for API server (attempt 3/5)...
Waiting for API server (attempt 4/5)...
Waiting for API server (attempt 5/5)...
Warning: API server did not become ready, attempting install anyway...
Installing kernel extension...
Error: Failed to connect to API server (timeout)

@jglogan jglogan force-pushed the ci-observability branch 3 times, most recently from cf477f4 to e661893 Compare February 12, 2026 22:26
@jglogan
Copy link
Contributor Author

jglogan commented Feb 12, 2026

I'm wondering if the install-kernel timeout is actually a race condition. Like, we're trying to install the kernel before the API server is fully ready to accept XPC connections. The service might be "started" but not fully initialized yet.

Perhaps. But the API server accepted an XPC connection at ClientHealthCheck.ping of SystemStart before trying to make the XPC connection through the installDefaultKernel call further down?

@jglogan jglogan force-pushed the ci-observability branch 3 times, most recently from c7d5c6b to c75e290 Compare February 13, 2026 00:50
let log = RuntimeLinuxHelper.setupLogger(debug: debug, metadata: ["uuid": "\(uuid)"])
let commandName = RuntimeLinuxHelper._commandName
let logPath = logRoot.map { $0.appending("\(commandName)-\(uuid).log") }
let log = ServiceLogger.bootstrap(category: "NetworkVmnetHelper", metadata: ["uuid": "\(uuid)"], debug: debug, logPath: logPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be RuntimeLinuxHelper.

@JaewonHur JaewonHur self-requested a review February 13, 2026 19:38
@jglogan jglogan force-pushed the ci-observability branch 3 times, most recently from f538082 to 4d3481e Compare February 16, 2026 18:42
) -> Logger {
LoggingSystem.bootstrap { label in
if let logPath {
if let handler = try? FileLogHandler(label: label, path: logPath) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the user be aware of a fallback to OS logs here in case of failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging isn't set up and this is in a service context, so I don't see a good way to communicate it.

We could try to make note of the condition here and then log a warning above L46, but that would still go into the OS log; it wouldn't be immediately visible to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'm not seeing a good way to go about it either. Not blocking for this PR, but maybe something to think about

- Currently we don't collect logs on CI builds,
  and we don't have permission to run the log
  command there. This PR adds a `--log-root`
  parameter for `container system start`,
  which gets propagated everywhere via the
  `CONTAINER_LOG_ROOT` variable, similarly
  to what we do for `--app-root` and `--install-root`.
- Use FilePath from swift-system for the log
  root. Foundation URL is a bit of a footgun
  for filesystem paths, so unless we identify
  a showstopper, we should incrementally
  transition to this type everywhere except
  where we really need network URLs.
- Set log root for the CI test phase, and
  archive/upload logs when tests fail.
- Output the hostname of the CI runner at the
  start of the test phase so we can identify
  runner-specific issues where they exist.
- Breaking change to CLI output - plumb log
  root into container system status, rework
  the command for consistency with resource
  list/inspect commands (i.e. table and JSON
  output), and add a unit test.
- Adds command reference documentation for
  `--log-root`.
@jglogan jglogan force-pushed the ci-observability branch 2 times, most recently from 29850b4 to a867300 Compare February 18, 2026 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants