Skip to content

Add overcommit spec and status#257

Draft
PhilippMatthes wants to merge 3 commits intomainfrom
add-overcommit-spec-and-status
Draft

Add overcommit spec and status#257
PhilippMatthes wants to merge 3 commits intomainfrom
add-overcommit-spec-and-status

Conversation

@PhilippMatthes
Copy link
Member

We want to reflect the overcommit ratio in the hypervisor crd. This is useful for cortex, so it can calculate how much space is left on a hypervisor. By default, it is assumed that the overcommit for each resource is set to 1. If anything else is specified, it will be reflected in the new effectiveCapacity status field by the kvm node agent. This will allow cortex to take control of the hypervisor overcommit ratio, for now statically, and in the future dynamically.

@fwiesel
Copy link
Contributor

fwiesel commented Mar 10, 2026

I have two issues with that change:

  • It doesn't make sense to me to set it in the spec, and default it to one. Who will set that value on a hypervisor level? My understanding is, Cortex is supposed to decide the over-commit ratio. An override over Cortex logic, I can understand, but that contradicts with "default is 1". If you need a configuration value for cortex, I'd put it somewhere else.

  • Use of corev1.ResourceName. That is a enum for k8s internal resources, which does have a big overlap with what we have in Openstack (three out of four values), but won't be the same. K8S has its extension logic for resources over these, so it will always be limited to these, but in openstack there is an extension mechanism for these resources as well, and I suspect we will have to make use of it eventually.

@PhilippMatthes
Copy link
Member Author

Let me give you some more context on my rationales behind the implementation. I understand the hypervisor CRD as an API that controls and surfaces a hypervisor' desired and actual state. This addition would make the desired overcommit and the actual state (effective capacity) transparant to consumers of the api.

By default, the overcommit will look as follows:

spec:
  overcommit: {}

which means, no overcommit is specified, i.e. we have a ratio of 1.

If any consumer of the api, such as cortex, wants to set this overcommit ratio to something else, it would be defined like this:

spec:
  overcommit:
    cpu: 3

highlighting the desired state is that the hypervisor exposes 3 times as many cpus as it actually has. The actual state is then reflected in the status.effectiveCapacity field. Doesn't it make sense to expose this api on this level?

I'm not sure about the float vs. uint aspect, we can reconsider. We can also use less limited map[string]any to depict other resource types than the ones standardized by corev1, if we need to. Currently we wouldn't need it though.

@fwiesel
Copy link
Contributor

fwiesel commented Mar 10, 2026

I think, we should be clear about who sets a value and who reads a value.
Is ´spec.overcommit` supposed to be an input for Cortex, or an output?

@PhilippMatthes
Copy link
Member Author

The spec is the desired state. In that terms, cortex desires the hypervisor to have y overcommit for x resource. Then, the controller reconciling the relevant parts of the resource, in this case the kvm node agent, would accept this desired state and reflect it in the status by calculating status.effectiveCapacity. This status can be used by cortex to do capacity calculations.

@auhlig
Copy link
Member

auhlig commented Mar 11, 2026

  • Use of corev1.ResourceName. That is a enum for k8s internal resources, which does have a big overlap with what we have in Openstack (three out of four values), but won't be the same. K8S has its extension logic for resources over these, so it will always be limited to these, but in openstack there is an extension mechanism for these resources as well, and I suspect we will have to make use of it eventually.

I agree with @fwiesel that this should be strongly typed, but we need to control it.
How about we introduce a separate type:

type ResourceName string

const (
  ResourceCPU ResourceName = "cpu"
  ...
)

...

# Use like
# map[ResourceName]resource.Quantity

EDIT: Basically copy&paste this definition

@github-actions
Copy link

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/api/v1 0.00% (ø)
github.com/cobaltcore-dev/openstack-hypervisor-operator/applyconfigurations/api/v1 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/api/v1/hypervisor_types.go 0.00% (ø) 0 0 0
github.com/cobaltcore-dev/openstack-hypervisor-operator/api/v1/zz_generated.deepcopy.go 0.00% (ø) 0 0 0
github.com/cobaltcore-dev/openstack-hypervisor-operator/applyconfigurations/api/v1/hypervisorspec.go 0.00% (ø) 0 0 0
github.com/cobaltcore-dev/openstack-hypervisor-operator/applyconfigurations/api/v1/hypervisorstatus.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@fwiesel
Copy link
Contributor

fwiesel commented Mar 11, 2026

Then, the controller reconciling the relevant parts of the resource, in this case the kvm node agent, ...

I really do not think we want to build up here a duplicate infrastructure and secondary communication channels.
It looks like we are working here towards the God-object.

From compute perspective, the nova agent reports the resources to placement. You own placement, and can do with that data what you want. Also store it in this CRD, if you insist. But at least I do not see a point involving the kvm-node-agent here.

@PhilippMatthes
Copy link
Member Author

The kvm node agent is our source for hypervisor allocation and capacity. It extracts this information directly from libvirt. So to me it makes sense that the node agent confirms the desired capacity has changed after the spec was adjusted. In this way, only one controller is tasked with owning this part of the status, and cortex is not involved apart from watching the state.

@fwiesel
Copy link
Contributor

fwiesel commented Mar 11, 2026

The initial purpose and intent of the kvm-node-agent according to the discussion I remember having with @auhlig, @dhoeller, @bashar-alkhateeb and @notandy was, that "duplication" is okay, if it presents us with another view from libvirt perspective. The equivalent of the Vcenter. I do not object exposing more information from libvirt, but here we are talking about a different concern.

The kvm node agent is our source for hypervisor allocation and capacity. [...]
It extracts this information directly from libvirt.

Yes, and that starts to worry me.
It reports capacity and allocations as seen from libvirt, which are not the capacity and allocations stored in Placement.
The allocations in Placement happen before anything happens on the host. It needs to happen before, as otherwise things will break with running VMs.

libvirt (and therefor the kvm-node-agent) is only guaranteed to know about running VMs. We need that view, to be able to see like in the vcenter, what is actually running on the host. If you want to also use it for scheduling purposes, feel free to do so. You are still responsible for the correctness of the scheduling decisions. I push back here on the implicit request on having higher guarantees from the kvm-node-agent than being of purely informational function.

Another problem is, only the resources you have right now are modeled in libvirt, it doesn't cover any custom resources, and network interface, (GPUs?) IOPS, etc.. won't be represented in libvirt, and therfore not synced by the kvm-node-agent, and in my opinion never should.
The "controller" for these is the nova-compute agent and will send that information to placement.

I see any kind of capacity management or workload management functionality aside from the basic reporting of readily available information in libvirt (or nova) outside of the scope of either the kvm-node-agent and the openstack-hypervisor-operator.

To to me it makes sense that the node agent confirms the desired capacity has changed after the spec was adjusted.

How can the node-agent make there any informed decision besides the most basic maths?
Where is the gain in doing that de-centrally distributed over multiple agents, which get replaced via a rolling update?

The question here is who is maintaining and is responsible for what. We need clear responsibilities and separation of concerns. This feels to me like you are moving here responsibilities out of Cortex, which clearly belong there.
Hence a public API change for something that should in my mind be handled internally.

@notandy Do you see it differently?

@PhilippMatthes
Copy link
Member Author

Thanks for the elaborate feedback. Regarding the point that allocation will be reflected only after the vm spins up. This will be considered by cortex' Reservation CRD which takes space before the VM is ordered to be scheduled. Apart from the delayed nature, the autodiscovered capacity and allocation + overcommit are assumed to be accurate. We autodiscover cpu and memory for now. Additions may come later, if needed, but the CRD proposal is already open enough to support them.

To your second point, I see a clear separation of concerns. Cortex, or any consumer, is able to express its need for more space or can give back some, via the desired state (the spec). Yes, the logic backing status.effectiveCapacity will be quite shallow, but it leaves the status ownership of autodiscovered capacity at its current owner, the kvm-node-agent.

@notandy
Copy link
Contributor

notandy commented Mar 11, 2026

How can the node-agent make there any informed decision besides the most basic maths? Where is the gain in doing that de-centrally distributed over multiple agents, which get replaced via a rolling update?

The question here is who is maintaining and is responsible for what. We need clear responsibilities and separation of concerns. This feels to me like you are moving here responsibilities out of Cortex, which clearly belong there. Hence a public API change for something that should in my mind be handled internally.

@notandy Do you see it differently?

I must say I don't see this a blocking argument. I am fine that the node-agent is updating the status based on overcommit value and existing VMs reported by libvirt. If it is, and stays, "basic math", something we would've done in modern SQL if etcd would be SQL, I am fine let kvm-node-agent doing it since its the easiest place to do it instead of introducing another reconciler that goes over all CROs just to calculate 1+1.

I respect that you like to see separate responsibilities as best as we can, but the architecture (as well as openstack is doing it with placement and nova, etc.) shows it's not always easy/possible.

Also, due to missing alternative approaches I am inclined to approve this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants