CI observability enhancements. by jglogan · Pull Request #1193 · apple/container

jglogan · 2026-02-11T04:23:23Z

Type of Change

Bug fix
New feature
Breaking change
Documentation update

Motivation and Context

Fix CI builds, add observability.

Currently we don't collect logs on CI builds, and we don't have permission to run the log command there. This PR adds a --log-root parameter for container system start, which gets propagated everywhere via the CONTAINER_LOG_ROOT variable, similarly to what we do for --app-root and --install-root.
Use FilePath from swift-system for the log root. Foundation URL is a bit of a footgun for filesystem paths, so unless we identify a showstopper, we should incrementally transition to this type everywhere except where we really need network URLs.
Set log root for the CI test phase, and archive/upload logs when tests fail.
Output the hostname of the CI runner at the start of the test phase so we can identify runner-specific issues where they exist.
Breaking change to CLI output - plumb log root into container system status, rework the command for consistency with resource list/inspect commands (i.e. table and JSON output), and add a unit test.
Adds command reference documentation for --log-root.

Testing

Tested locally
Added/updated tests
Added/updated docs

Sources/Helpers/APIServer/APIServer+Start.swift

Ronitsabhaya75 · 2026-02-12T03:34:52Z

@jglogan can you have look at #1167 I have added retry logic for api server so that we can atleast 5 tries during install kernel.

I saw CI has same failure. so i added retry logic instead of 1 time trying install kernel added 5 times.

lemme know if this fix can help

jglogan · 2026-02-12T03:59:05Z

@Ronitsabhaya75 Retrying around undiagnosed problems is not the answer, and in this case you can retry a million times and it still won't work.

Ronitsabhaya75 · 2026-02-12T04:01:30Z

@Ronitsabhaya75 Retrying around real problems is not the answer, and in this case you can retry a million times and it still won't work.

let me see what can potential issue for installing-kernel can be. thats where CI has been failing all time.

jglogan · 2026-02-12T04:02:20Z

I am committing a likely fix. The problem has nothing to do with installing the kernel.

Ronitsabhaya75 · 2026-02-12T04:04:06Z

Verifying apiserver is running...
make: *** [install-kernel] Error 1
tests failed with status: 2
Removing data directory /Users/runner/actions-runner/_work/_temp/tmp.Moka3ycaZe
Error: Process completed with exit code 2.

yeah time out issue (wording mistake)

Ronitsabhaya75 · 2026-02-12T04:49:04Z

@jglogan do you think this is good idea adding endpoint that captures the system state like running containers,
network status, service health at any point.

then calling it automatically before operation when CI is failing ( like install kernel stage).

this can give us snapshot of before and after for comparision.

Ronitsabhaya75 · 2026-02-12T16:42:29Z

Hey @jglogan, I was thinking about this last night after our conversation. You're right that retries won't fix this but I think I might have an idea about what's actually going wrong.

I'm wondering if the install-kernel timeout is actually a race condition. Like, we're trying to install the kernel before the API server is fully ready to accept XPC connections. The service might be "started" but not fully initialized yet.

Instead of retrying or just increasing the timeout, what if we add a proper readiness check? And actually, your new JSON status output makes this really easy to implement:

install-kernel: start-system
	@echo "Waiting for system to be ready..."
	@for i in 1 2 3 4 5; do \
		if $(CONTAINER) system status --output json | jq -e '.apiServerRunning == true' > /dev/null 2>&1; then \
			echo "API server ready"; \
			break; \
		fi; \
		echo "Waiting for API server (attempt $$i/5)..."; \
		sleep 2; \
	done
	$(CONTAINER) kernel install

sample output

$ make install-kernel
Waiting for system to be ready...
Waiting for API server (attempt 1/5)...
Waiting for API server (attempt 2/5)...
Waiting for API server (attempt 3/5)...
Waiting for API server (attempt 4/5)...
Waiting for API server (attempt 5/5)...
Warning: API server did not become ready, attempting install anyway...
Installing kernel extension...
Error: Failed to connect to API server (timeout)

jglogan · 2026-02-12T23:42:32Z

I'm wondering if the install-kernel timeout is actually a race condition. Like, we're trying to install the kernel before the API server is fully ready to accept XPC connections. The service might be "started" but not fully initialized yet.

Perhaps. But the API server accepted an XPC connection at ClientHealthCheck.ping of SystemStart before trying to make the XPC connection through the installDefaultKernel call further down?

JaewonHur · 2026-02-13T02:01:32Z

Sources/Helpers/RuntimeLinux/RuntimeLinuxHelper+Start.swift

-            let log = RuntimeLinuxHelper.setupLogger(debug: debug, metadata: ["uuid": "\(uuid)"])
+            let commandName = RuntimeLinuxHelper._commandName
+            let logPath = logRoot.map { $0.appending("\(commandName)-\(uuid).log") }
+            let log = ServiceLogger.bootstrap(category: "NetworkVmnetHelper", metadata: ["uuid": "\(uuid)"], debug: debug, logPath: logPath)


Should be RuntimeLinuxHelper.

ajemory · 2026-02-16T22:17:38Z

Sources/ContainerLog/ServiceLogger.swift

+    ) -> Logger {
+        LoggingSystem.bootstrap { label in
+            if let logPath {
+                if let handler = try? FileLogHandler(label: label, path: logPath) {


Will the user be aware of a fallback to OS logs here in case of failure?

Logging isn't set up and this is in a service context, so I don't see a good way to communicate it.

We could try to make note of the condition here and then log a warning above L46, but that would still go into the OS log; it wouldn't be immediately visible to the user.

Makes sense. I'm not seeing a good way to go about it either. Not blocking for this PR, but maybe something to think about

- Currently we don't collect logs on CI builds, and we don't have permission to run the log command there. This PR adds a `--log-root` parameter for `container system start`, which gets propagated everywhere via the `CONTAINER_LOG_ROOT` variable, similarly to what we do for `--app-root` and `--install-root`. - Use FilePath from swift-system for the log root. Foundation URL is a bit of a footgun for filesystem paths, so unless we identify a showstopper, we should incrementally transition to this type everywhere except where we really need network URLs. - Set log root for the CI test phase, and archive/upload logs when tests fail. - Output the hostname of the CI runner at the start of the test phase so we can identify runner-specific issues where they exist. - Breaking change to CLI output - plumb log root into container system status, rework the command for consistency with resource list/inspect commands (i.e. table and JSON output), and add a unit test. - Adds command reference documentation for `--log-root`.

jglogan force-pushed the ci-observability branch from 3f7a39b to 37447a2 Compare February 11, 2026 04:44

jglogan changed the base branch from ci-observability to main February 11, 2026 04:47

jglogan force-pushed the ci-observability branch from 37447a2 to c98a999 Compare February 11, 2026 04:47

jglogan commented Feb 11, 2026

View reviewed changes

Sources/Helpers/APIServer/APIServer+Start.swift Outdated Show resolved Hide resolved

ajemory approved these changes Feb 11, 2026

View reviewed changes

jglogan force-pushed the ci-observability branch 3 times, most recently from 16eb03e to 7534596 Compare February 12, 2026 04:32

jglogan force-pushed the ci-observability branch 3 times, most recently from cf477f4 to e661893 Compare February 12, 2026 22:26

jglogan force-pushed the ci-observability branch 3 times, most recently from c7d5c6b to c75e290 Compare February 13, 2026 00:50

JaewonHur reviewed Feb 13, 2026

View reviewed changes

JaewonHur self-requested a review February 13, 2026 19:38

JaewonHur approved these changes Feb 13, 2026

View reviewed changes

jglogan force-pushed the ci-observability branch 3 times, most recently from f538082 to 4d3481e Compare February 16, 2026 18:42

ajemory reviewed Feb 16, 2026

View reviewed changes

ajemory approved these changes Feb 16, 2026

View reviewed changes

jglogan added 9 commits February 17, 2026 16:33

Fix incorrect log category setup.

5664d3e

Increase timestamp precision, make log messages consistent.

8be213c

Merge log message and logger metadata.

01725d4

Fix CI issues related to slow XPC launches.

dcb3e83

Reenable TestCLINetwork.

4c39533

Add logging to container.create

d97dc9f

Temporarily swap TestCLIRunLifecycle and TestCLINetwork.

dceea3a

Only run lifecycle tests to get another error sample.

ba42cfc

jglogan force-pushed the ci-observability branch 2 times, most recently from 29850b4 to a867300 Compare February 18, 2026 01:12

More log messages in ContainersService.create()

bd5709e

jglogan force-pushed the ci-observability branch from a867300 to bd5709e Compare February 18, 2026 01:29

jglogan added 2 commits February 17, 2026 17:31

Workflow change to always collect and upload logs.

4712374

Add log messages to troubleshoot containers locking

b86cda2

jglogan force-pushed the ci-observability branch from d7e3a2b to b86cda2 Compare February 18, 2026 02:36

jglogan added 2 commits February 17, 2026 19:29

Lengthen registration timeout to 60s

038e749

Restore original test order.

eb40b71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI observability enhancements.#1193

CI observability enhancements.#1193
jglogan wants to merge 14 commits intoapple:mainfrom
jglogan:ci-observability

jglogan commented Feb 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

jglogan commented Feb 12, 2026 •

edited

Loading

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

jglogan commented Feb 12, 2026

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026 •

edited

Loading

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

jglogan commented Feb 12, 2026

Uh oh!

JaewonHur Feb 13, 2026

Uh oh!

ajemory Feb 16, 2026

Uh oh!

jglogan Feb 16, 2026

Uh oh!

ajemory Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jglogan commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Change

Motivation and Context

Testing

Uh oh!

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

jglogan commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

jglogan commented Feb 12, 2026

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

Ronitsabhaya75 commented Feb 12, 2026

Uh oh!

jglogan commented Feb 12, 2026

Uh oh!

JaewonHur Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

ajemory Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

jglogan Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

ajemory Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jglogan commented Feb 11, 2026 •

edited

Loading

jglogan commented Feb 12, 2026 •

edited

Loading

Ronitsabhaya75 commented Feb 12, 2026 •

edited

Loading