Conversation
3f7a39b to
37447a2
Compare
37447a2 to
c98a999
Compare
|
@Ronitsabhaya75 Retrying around undiagnosed problems is not the answer, and in this case you can retry a million times and it still won't work. |
let me see what can potential issue for installing-kernel can be. thats where CI has been failing all time. |
|
I am committing a likely fix. The problem has nothing to do with installing the kernel. |
yeah time out issue (wording mistake) |
16eb03e to
7534596
Compare
|
@jglogan do you think this is good idea adding endpoint that captures the system state like running containers, then calling it automatically before operation when CI is failing ( like install kernel stage). this can give us snapshot of before and after for comparision. |
|
Hey @jglogan, I was thinking about this last night after our conversation. You're right that retries won't fix this but I think I might have an idea about what's actually going wrong. I'm wondering if the install-kernel timeout is actually a race condition. Like, we're trying to install the kernel before the API server is fully ready to accept XPC connections. The service might be "started" but not fully initialized yet. Instead of retrying or just increasing the timeout, what if we add a proper readiness check? And actually, your new JSON status output makes this really easy to implement: install-kernel: start-system
@echo "Waiting for system to be ready..."
@for i in 1 2 3 4 5; do \
if $(CONTAINER) system status --output json | jq -e '.apiServerRunning == true' > /dev/null 2>&1; then \
echo "API server ready"; \
break; \
fi; \
echo "Waiting for API server (attempt $$i/5)..."; \
sleep 2; \
done
$(CONTAINER) kernel installsample output $ make install-kernel
Waiting for system to be ready...
Waiting for API server (attempt 1/5)...
Waiting for API server (attempt 2/5)...
Waiting for API server (attempt 3/5)...
Waiting for API server (attempt 4/5)...
Waiting for API server (attempt 5/5)...
Warning: API server did not become ready, attempting install anyway...
Installing kernel extension...
Error: Failed to connect to API server (timeout) |
cf477f4 to
e661893
Compare
Perhaps. But the API server accepted an XPC connection at |
c7d5c6b to
c75e290
Compare
| let log = RuntimeLinuxHelper.setupLogger(debug: debug, metadata: ["uuid": "\(uuid)"]) | ||
| let commandName = RuntimeLinuxHelper._commandName | ||
| let logPath = logRoot.map { $0.appending("\(commandName)-\(uuid).log") } | ||
| let log = ServiceLogger.bootstrap(category: "NetworkVmnetHelper", metadata: ["uuid": "\(uuid)"], debug: debug, logPath: logPath) |
There was a problem hiding this comment.
Should be RuntimeLinuxHelper.
f538082 to
4d3481e
Compare
| ) -> Logger { | ||
| LoggingSystem.bootstrap { label in | ||
| if let logPath { | ||
| if let handler = try? FileLogHandler(label: label, path: logPath) { |
There was a problem hiding this comment.
Will the user be aware of a fallback to OS logs here in case of failure?
There was a problem hiding this comment.
Logging isn't set up and this is in a service context, so I don't see a good way to communicate it.
We could try to make note of the condition here and then log a warning above L46, but that would still go into the OS log; it wouldn't be immediately visible to the user.
There was a problem hiding this comment.
Makes sense. I'm not seeing a good way to go about it either. Not blocking for this PR, but maybe something to think about
- Currently we don't collect logs on CI builds, and we don't have permission to run the log command there. This PR adds a `--log-root` parameter for `container system start`, which gets propagated everywhere via the `CONTAINER_LOG_ROOT` variable, similarly to what we do for `--app-root` and `--install-root`. - Use FilePath from swift-system for the log root. Foundation URL is a bit of a footgun for filesystem paths, so unless we identify a showstopper, we should incrementally transition to this type everywhere except where we really need network URLs. - Set log root for the CI test phase, and archive/upload logs when tests fail. - Output the hostname of the CI runner at the start of the test phase so we can identify runner-specific issues where they exist. - Breaking change to CLI output - plumb log root into container system status, rework the command for consistency with resource list/inspect commands (i.e. table and JSON output), and add a unit test. - Adds command reference documentation for `--log-root`.
29850b4 to
a867300
Compare
a867300 to
bd5709e
Compare
d7e3a2b to
b86cda2
Compare
Type of Change
Motivation and Context
Fix CI builds, add observability.
logcommand there. This PR adds a--log-rootparameter forcontainer system start, which gets propagated everywhere via the CONTAINER_LOG_ROOT variable, similarly to what we do for--app-rootand--install-root.container system status, rework the command for consistency with resource list/inspect commands (i.e. table and JSON output), and add a unit test.--log-root.Testing