Skip to content

[linux-nvidia-6.17] Backport MPAM fixes and support for CPU-less NUMA nodes#348

Open
fyu1 wants to merge 563 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next.mpam.extras.fixes2
Open

[linux-nvidia-6.17] Backport MPAM fixes and support for CPU-less NUMA nodes#348
fyu1 wants to merge 563 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next.mpam.extras.fixes2

Conversation

@fyu1
Copy link
Copy Markdown
Collaborator

@fyu1 fyu1 commented Mar 20, 2026

This PR replaces #328

This branch fixes a few MPAM issues including:

  1. Performance issue due to small MBW_MIN on Grace: https://nvbugspro.nvidia.com/bug/5928376
  2. Performance issue due to 0 CMAX on Vera: https://nvbugspro.nvidia.com/bug/5717435
  3. Stress Online/offline issue on Vera: https://nvbugspro.nvidia.com/bug/5919525
  4. Clean up numa node MBA/MBM code to avoid future issues.

There are total 49 patches:

  1. The first 10 patches revert ARM's extra patches which are numa node, event filter, and mem hotplug patches. The patches are buggy and cause most of the above issues.
  2. The patches 11 and 12 revert old buggy T241-MPAM-4 Grace erratum workaround and apply an updated one.
  3. The patches 13-42 are from resctrl upstream for mainly alignment of monitoring type for the later numa patches.
  4. The patches 43-49 are mainly supporting CPU-less and numa node, plus fixing IOMMU, MSC tear down, MBWU type issues.

This is patches list:
0001-Revert-NVIDIA-SAUCE-untested-arm_mpam-resctrl-Allow-.patch
0002-Revert-NVIDIA-SAUCE-arm_mpam-resctrl-Add-NUMA-node-n.patch
0003-Revert-NVIDIA-SAUCE-untested-arm_mpam-resctrl-Split-.patch
0004-Revert-NVIDIA-SAUCE-arm_mpam-resctrl-Change-domain_h.patch
0005-Revert-NVIDIA-SAUCE-arm_mpam-resctrl-Pick-whether-MB.patch
0006-Revert-NVIDIA-SAUCE-Fix-unused-variable-warning.patch
0007-Revert-NVIDIA-SAUCE-fs-resctrl-Add-mount-option-for-.patch
0008-Revert-NVIDIA-SAUCE-fs-resctrl-Take-memory-hotplug-l.patch
0009-Revert-NVIDIA-SAUCE-mm-memory_hotplug-Add-lockdep-as.patch
0010-Revert-NVIDIA-SAUCE-untested-arm_mpam-resctrl-Allow-.patch
0011-Revert-NVIDIA-SAUCE-arm_mpam-Add-workaround-for-T241.patch
0012-NVIDIA-SAUCE-arm_mpam-Add-workaround-for-T241-MPAM-4.patch
0013-x86-fs-resctrl-Improve-domain-type-checking.patch
0014-x86-resctrl-Move-L3-initialization-into-new-helper-f.patch
0015-x86-resctrl-Refactor-domain_remove_cpu_mon-ready-for.patch
0016-x86-resctrl-Clean-up-domain_remove_cpu_ctrl.patch
0017-x86-fs-resctrl-Refactor-domain-create-remove-using-s.patch
0018-fs-resctrl-Split-L3-dependent-parts-out-of-mon_eve.patch
0019-x86-fs-resctrl-Use-struct-rdt_domain_hdr-when-readin.patch
0020-x86-fs-resctrl-Rename-struct-rdt_mon_domain-and-rdt
.patch
0021-x86-fs-resctrl-Rename-some-L3-specific-functions.patch
0022-fs-resctrl-Make-event-details-accessible-to-function.patch
0023-x86-fs-resctrl-Handle-events-that-can-be-read-from-a.patch
0024-x86-fs-resctrl-Support-binary-fixed-point-event-coun.patch
0025-x86-fs-resctrl-Add-an-architectural-hook-called-for-.patch
0026-x86-fs-resctrl-Add-and-initialize-a-resource-for-pac.patch
0027-fs-resctrl-Emphasize-that-L3-monitoring-resource-is-.patch
0028-x86-resctrl-Discover-hardware-telemetry-events.patch
0029-x86-fs-resctrl-Fill-in-details-of-events-for-perform.patch
0030-x86-fs-resctrl-Add-architectural-event-pointer.patch
0031-x86-resctrl-Find-and-enable-usable-telemetry-events.patch
0032-x86-resctrl-Read-telemetry-events.patch
0033-fs-resctrl-Refactor-mkdir_mondata_subdir.patch
0034-fs-resctrl-Refactor-rmdir_mondata_subdir_allrdtgrp.patch
0035-x86-fs-resctrl-Handle-domain-creation-deletion-for-R.patch
0036-x86-resctrl-Add-energy-perf-choices-to-rdt-boot-opti.patch
0037-x86-resctrl-Handle-number-of-RMIDs-supported-by-RDT
.patch
0038-fs-resctrl-Move-allocation-free-of-closid_num_dirty_.patch
0039-x86-fs-resctrl-Compute-number-of-RMIDs-as-minimum-ac.patch
0040-fs-resctrl-Move-RMID-initialization-to-first-mount.patch
0041-x86-resctrl-Enable-RDT_RESOURCE_PERF_PKG.patch
0042-x86-fs-resctrl-Update-documentation-for-telemetry-ev.patch
0043-NVIDIA-VR-SAUCE-arm_mpam-Fix-compilation-errors.patch
0044-NVIDIA-SAUCE-arm_mpam-Avoid-MSC-teardown-for-the-SW-.patch
0045-NVIDIA-VR-SAUCE-arm_mpam-Handle-CPU-less-numa-nodes.patch
0046-NVIDIA-VR-SAUCE-arm_mpam-Include-all-associated-MSC-.patch
0047-NVIDIA-SAUCE-resctrl-mpam-reset-RIS-by-applying-expl.patch
0048-NVIDIA-SAUCE-iommu-arm-smmu-v3-Fix-MPAM-for-indentit.patch
0049-NVIDIA-VR-SAUCE-arm_mpam-Resolve-MBWU-type-before-fe.patch

Test results are in http://10.112.214.86/vera/tests/ including

  1. init registers test
  2. iommu assignment test
  3. online/offline test
  4. Spec2017 performance test
  5. CXL test

GPU MPAM test is not covered because as of now there is SBIOS support for the feature yet.


LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2146389

ianm-nv and others added 30 commits February 13, 2026 16:49
Ignore: yes
Signed-off-by: Ian May <ianm@nvidia.com>
Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
… DGX Spark

BugLink: https://bugs.launchpad.net/bugs/2138269

This driver manages PCIe link for NVIDIA ConnectX-7 (CX7) hot-plug/unplug
on DGX Spark systems with GB10 SoC. It disables the PCIe link
on cable removal and enables it on cable insertion.

Upstream-friendly improvements over 6.14 driver:
- Separated from MTK pinctrl driver into NVIDIA platform driver
- Configuration via ACPI (_CRS and _DSD), no hardcoded values
- Device-managed resources (devm_*) for automatic cleanup
- Thread-safe state management with locking
- Enhanced error handling and logging
- Uses standard Linux kernel APIs

The driver exposes a sysfs interface to emulate cable plug in/out:
  echo 1 > /sys/devices/platform/MTKP0001:00/pcie_hotplug/debug_state  # plug in
  echo 0 > /sys/devices/platform/MTKP0001:00/pcie_hotplug/debug_state  # plug out

It also provides a runtime enable/disable switch via sysfs:
  echo 1 > /sys/devices/platform/MTKP0001:00/pcie_hotplug/hotplug_enabled  # Enable
  echo 0 > /sys/devices/platform/MTKP0001:00/pcie_hotplug/hotplug_enabled  # Disable

This allows enabling/disabling hotplug functionality. Hotplug is disabled by default
and must be explicitly enabled via userspace.

It also implements uevent notifications for coordination with userspace:

* cable plug-in:
    Report plug-in uevent (driver)
    Enable PCIe link (driver)
    Rescan CX7 devices (application)

* cable removal:
    Report removal uevent (driver)
    Remove CX7 devices (application)
    Disable PCIe link (driver)

Signed-off-by: Vaibhav Vyas <vavyas@nvidia.com>
Signed-off-by: Scott Fudally <sfudally@nvidia.com>
Signed-off-by: Surabhi Chythanya Kumar <schythanyaku@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2059814

Signed-off-by: Brad Figg <bfigg@nvidia.com>
Acked-by: Brad Figg <bfigg@nvidia.com>
Acked-by: Ian May <ian.may@canonical.com>
Signed-off-by: Ian May <ian.may@canonical.com>
Signed-off-by: Jacob Martin <jacob.martin@canonical.com>
(cherry picked from commit a64b597
linux-nvidia-6.14)
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
Ignore: yes
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2137561
Properties: no-test-build
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
…ernel-versions (main/d2025.12.18)

BugLink: https://bugs.launchpad.net/bugs/1786013
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
… control

BugLink: https://bugs.launchpad.net/bugs/2138755

The selection of MLO mode should depend on the capabilities of the STA
rather than those of the peer AP to avoid compatibility issues with
certain APs, such as Xiaomi BE5000 WiFi7 router.

Fixes: 69acd6d ("wifi: mt76: mt7925: add mt7925_change_vif_links")
Signed-off-by: Leon Yen <leon.yen@mediatek.com>
(backported from https://lore.kernel.org/all/20251211123836.4169436-1-leon.yen@mediatek.com/)
Signed-off-by: Muteeb Akram <mdoctor@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
Ignore: yes
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2138765
Properties: no-test-build
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Abdur Rahman <abdur.rahman@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2138329

Tegra410 and Tegra241 have deprecated HIDREV register. It is
recommended to use ARM SMCCC calls to get chip_id, major and minor
revisions.

Use ARM SMCCC to get chip_id, major and minor revision.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…ERIC_ARCH_TOPOLOGY

BugLink: https://bugs.launchpad.net/bugs/2138375

The arm_pmu driver is using topology_core_has_smt() for retrieving
the SMT implementation which depends on CONFIG_GENERIC_ARCH_TOPOLOGY.
The config is optional on arm platforms so provide a
!CONFIG_GENERIC_ARCH_TOPOLOGY stub for topology_core_has_smt().

Fixes: c3d78c3 ("perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511041757.vuCGOmFc-lkp@intel.com/
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Yicong Yang <yangyccccc@gmail.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit 7ab06ea)
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138266

Type 2 devices are being introduced and will require finer-grained
reset mechanisms beyond bus-wide reset methods.

Add support for CXL reset per CXL v3.2 Section 9.6/9.7

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/all/20250221043906.1593189-3-smadhavan@nvidia.com/)
[Nirmoy: Add #include "../cxl/cxlpci.h" and fix a compile error with if (reg & CXL_DVSEC_CXL_RST_CAPABLE == 0)]
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138266

The cxl core in linux updated to supported committed
decoders of zero size, because this is allowed by
the CXL spec.

This patch updates cxl_test to enable decoders 1 and 2
in the host-bridge 0 port, in a switch uport under hb0,
and the endpoints ports with size zero simulating
committed zero sized decoders.

Signed-off-by: Vishal Aslot <vaslot@nvidia.com>
(backported from https://lore.kernel.org/all/20251015024019.1189713-1-vaslot@nvidia.com/)
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138266

CXL spec permits committing zero sized decoders.
Linux currently considers them as an error.

Zero-sized decoders are helpful when the BIOS
is committing them. Often BIOS will also lock
them to prevent them being changed due to the
TSP requirement. For example, if the type 3
device is part of a TCB.

The host bridge, switch, and end-point decoders
can all be committed with zero-size. If they are
locked along the VH, it is often to prevent
hotplugging of a new device that could not be
attested post boot and cannot be included in
TCB.

The caller leaves the decoder allocated but does
not add it. It simply continues to the next decoder.

Signed-off-by: Vishal Aslot <vaslot@nvidia.com>
(backported from https://lore.kernel.org/all/20251015024019.1189713-1-vaslot@nvidia.com/)
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138266

The loop condition in __cxl_dpa_reserve() is missing the comparison
operator, causing potential infinite loop and array out-of-bounds:

    for (int i = 0; cxlds->nr_partitions; i++)

Should be:

    for (int i = 0; i < cxlds->nr_partitions; i++)

Without the '<' operator, if no partition matches the decoder's DPA
resource, 'i' increments beyond the part[] array bounds (size 2),
triggering UBSAN errors and corrupting the part index.

Fixes: be5cbd0 ("cxl: Kill enum cxl_decoder_mode")
Signed-off-by: Koba Ko <kobak@nvidia.com>
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…access

BugLink: https://bugs.launchpad.net/bugs/2138266

Check partition index bounds before accessing cxlds->part[] to prevent
out-of-bounds when part is -1 or invalid.

Fixes: 5ec6759) cxl/region: Drop goto pattern of construct_region()
Signed-off-by: Koba Ko <kobak@nvidia.com>
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138266

Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

Add compatible and the hardware struct for Tegra256. Tegra256 controllers
use a different parent clock. Hence the timing parameters are different
from the previous generations to meet the expected frequencies.

Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
(cherry picked from commit 6e3cb25)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

On Tegra264, not all I2C controllers have the necessary interface to
GPC DMA, this causes failures when function tegra_i2c_init_dma()
is called.

Ensure that "dmas" device-tree property is present before initializing
DMA in function tegra_i2c_init_dma().

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
(backported from https://lore.kernel.org/linux-tegra/20251118140620.549-1-akhilrajeev@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…stplus

BugLink: https://bugs.launchpad.net/bugs/2138238

The current implementation uses a single value of THIGH, TLOW and setup
hold time for both fast and fastplus. But these values can be different
for each speed mode and should be using separate variables. Split the
variables used for fast and fast plus mode.

Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
(backported from https://lore.kernel.org/linux-tegra/20251118140620.549-1-akhilrajeev@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

Update the timing parameters of Tegra256 so that the signals are complaint
with the I2C specification for SCL low time.

Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
(backported from https://lore.kernel.org/linux-tegra/20251118140620.549-1-akhilrajeev@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

Add support for High Speed (HS) mode transfers for Tegra194 and later
chips. While HS mode has been documented in the technical reference
manuals since Tegra20, the hardware implementation appears to be broken
on all chips prior to Tegra194.

When HS mode is not supported, set the frequency to FM+ instead.

Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
(backported from https://lore.kernel.org/linux-tegra/20251118140620.549-1-akhilrajeev@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

Add support for SW mutex register introduced in Tegra264 to provide
an option to share the interface between multiple firmwares and/or
VMs. This involves following steps:

 - A firmware/OS writes its unique ID to the mutex REQUEST field.
 - Ownership is established when reading the GRANT field returns the
   same ID.
 - If GRANT shows a different non-zero ID, the firmware/OS retries
   until timeout.
 - After completing access, it releases the mutex by writing 0.

However, the hardware does not ensure any protection based on the
values. The driver/firmware should honor the peer who already holds
the mutex.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
(backported from https://lore.kernel.org/linux-tegra/20251118140620.549-1-akhilrajeev@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

Add support for Tegra264 SoC which supports 17 generic I2C controllers,
two of which are in the AON (always-on) partition of the SoC. In
addition to the features supported by Tegra194 it also supports a
SW mutex register to allow sharing the same I2C instance across
multiple firmware.

Signed-off-by: Akhil R <akhilrajeev@nvidia.com>
Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
(backported from https://lore.kernel.org/linux-tegra/20251118140620.549-1-akhilrajeev@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…y DVC and VI

BugLink: https://bugs.launchpad.net/bugs/2138238

Replace the per-instance boolean flags with an enum tegra_i2c_variant
since DVC and VI are mutually exclusive. Update IS_DVC/IS_VI and variant
initialization accordingly.

Suggested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
(backported from https://lore.kernel.org/all/20260107142649.14917-1-kkartik@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

Move the variant field into tegra_i2c_hw_feature and populate it for all
SoCs. Add dedicated SoC data for "nvidia,tegra20-i2c-dvc" and
"nvidia,tegra210-i2c-vi" compatibles. Drop the compatible-string checks
from tegra_i2c_parse_dt to initialize the Tegra I2C variant.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
(backported from https://lore.kernel.org/all/20260107142649.14917-1-kkartik@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…r offsets

BugLink: https://bugs.launchpad.net/bugs/2138238

Tegra410 use different offsets for existing I2C registers, update
the logic to use appropriate offsets per SoC.

As the registers offsets are now also defined for dvc and vi, following
function are not required and they are removed:
 - tegra_i2c_reg_addr(): No translation required.
 - dvc_writel(): Replaced with i2c_writel() with DVC check.
 - dvc_readl(): Replaced with i2c_readl().

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
(backported from https://lore.kernel.org/all/20260107142649.14917-1-kkartik@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138238

Add support for the Tegra410 SoC, which has 4 I2C controllers. The
controllers are feature-equivalent to Tegra264; only the register
offsets differ.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
(backported from https://lore.kernel.org/all/20260107142649.14917-1-kkartik@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
aegl and others added 22 commits March 25, 2026 22:18
Each CPU collects data for telemetry events that it sends to the nearest
telemetry event aggregator either when the value of MSR_IA32_PQR_ASSOC.RMID
changes, or when a two millisecond timer expires.

There is a feature type ("energy" or "perf"), GUID, and MMIO region associated
with each aggregator. This combination links to an XML description of the
set of telemetry events tracked by the aggregator. XML files are published
by Intel in a GitHub repository¹.

The telemetry event aggregators maintain per-RMID per-event counts of the
total seen for all the CPUs. There may be multiple telemetry event aggregators
per package.

There are separate sets of aggregators for each feature type. Aggregators
in a set may have different GUIDs. All aggregators with the same feature
type and GUID are symmetric keeping counts for the same set of events for
the CPUs that provide data to them.

The XML file for each aggregator provides the following information:
0) Feature type of the events ("perf" or "energy")
1) Which telemetry events are tracked by the aggregator.
2) The order in which the event counters appear for each RMID.
3) The value type of each event counter (integer or fixed-point).
4) The number of RMIDs supported.
5) Which additional aggregator status registers are included.
6) The total size of the MMIO region for an aggregator.

Introduce struct event_group that condenses the relevant information from
an XML file. Hereafter an "event group" refers to a group of events of a
particular feature type (event_group::pfname set to "energy" or "perf") with
a particular GUID.

Use event_group::pfname to determine the feature id needed to obtain the
aggregator details. It will later be used in console messages and with the
rdt= boot parameter.

The INTEL_PMT_TELEMETRY driver enumerates support for telemetry events.
This driver provides intel_pmt_get_regions_by_feature() to list all available
telemetry event aggregators of a given feature type. The list includes the
"guid", the base address in MMIO space for the region where the event counters
are exposed, and the package id where the all the CPUs that report to this
aggregator are located.

Call INTEL_PMT_TELEMETRY's intel_pmt_get_regions_by_feature() for each event
group to obtain a private copy of that event group's aggregator data. Duplicate
the aggregator data between event groups that have the same feature type
but different GUID. Further processing on this private copy will be unique
to the event group.

  ¹https://github.com/intel/Intel-PMT

  [ bp: Zap text explaining the code, s/guid/GUID/g ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 1fb2daa)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…GUIDs

The telemetry event aggregators of the Intel Clearwater Forest CPU support two
RMID-based feature types: "energy" with GUID 0x26696143¹, and "perf" with
GUID 0x26557651².

The event counter offsets in an aggregator's MMIO space are arranged in groups
for each RMID.

E.g., the "energy" counters for GUID 0x26696143 are arranged like this:

  MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY
  MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY
  MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY
  MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY
  ...
  MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY
  MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY

After all counters there are three status registers that provide indications
of how many times an aggregator was unable to process event counts, the time
stamp for the most recent loss of data, and the time stamp of the most recent
successful update.

  MMIO offset:0x2400 AGG_DATA_LOSS_COUNT
  MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP
  MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP

Define event_group structures for both of these aggregator types and define
the events tracked by the aggregators in the file system code.

PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point format.
File system code must output as floating point values.

  ¹https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml
  ²https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml

  [ bp: Massage commit message. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 8f6b6ad)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
The resctrl file system layer passes the domain, RMID, and event id to the
architecture to fetch an event counter.

Fetching a telemetry event counter requires additional information that is
private to the architecture, for example, the offset into MMIO space from
where the counter should be read.

Add mon_evt::arch_priv that architecture can use for any private data related
to the event. The resctrl filesystem initializes mon_evt::arch_priv when the
architecture enables the event and passes it back to architecture when needing
to fetch an event counter.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(backported from commit 8ccb1f8)
[fenghuay: fix minor conflicts in __check_limbo()]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Every event group has a private copy of the data of all telemetry event
aggregators (aka "telemetry regions") tracking its feature type. Included
may be regions that have the same feature type but tracking different GUID
from the event group's.

Traverse the event group's telemetry region data and mark all regions that
are not usable by the event group as unusable by clearing those regions'
MMIO addresses. A region is considered unusable if:
1) GUID does not match the GUID of the event group.
2) Package ID is invalid.
3) The enumerated size of the MMIO region does not match the expected
   value from the XML description file.

Hereafter any telemetry region with an MMIO address is considered valid for
the event group it is associated with.

Enable all the event group's events as long as there is at least one usable
region from where data for its events can be read. Enabling of an event can
fail if the same event has already been enabled as part of another event
group. It should never happen that the same event is described by different
GUID supported by the same system so just WARN (via resctrl_enable_mon_event())
and skip the event.

Note that it is architecturally possible that some telemetry events are only
supported by a subset of the packages in the system. It is not expected that
systems will ever do this. If they do the user will see event files in resctrl
that always return "Unavailable".

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 7e6df96)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Introduce intel_aet_read_event() to read telemetry events for resource
RDT_RESOURCE_PERF_PKG. There may be multiple aggregators tracking each
package, so scan all of them and add up all counters. Aggregators may return
an invalid data indication if they have received no records for a given RMID.
The user will see "Unavailable" if none of the aggregators on a package
provide valid counts.

Resctrl now uses readq() so depends on X86_64. Update Kconfig.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 51541f6)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Population of a monitor group's mon_data directory is unreasonably complicated
because of the support for Sub-NUMA Cluster (SNC) mode.

Split out the SNC code into a helper function to make it easier to add support
for a new telemetry resource.

Move all the duplicated code to make and set owner of domain directories into
the mon_add_all_files() helper and rename to _mkdir_mondata_subdir().

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 0ec1db4)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Clearing a monitor group's mon_data directory is complicated because of the
support for Sub-NUMA Cluster (SNC) mode.

Refactor the SNC case into a helper function to make it easier to add support
for a new telemetry resource.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 93d9fd8)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…_PKG

The L3 resource has several requirements for domains. There are per-domain
structures that hold the 64-bit values of counters, and elements to keep
track of the overflow and limbo threads.

None of these are needed for the PERF_PKG resource. The hardware counters
are wide enough that they do not wrap around for decades.

Define a new rdt_perf_pkg_mon_domain structure which just consists of the
standard rdt_domain_hdr to keep track of domain id and CPU mask.

Update resctrl_online_mon_domain() for RDT_RESOURCE_PERF_PKG. The only action
needed for this resource is to create and populate domain directories if a
domain is added while resctrl is mounted.

Similarly resctrl_offline_mon_domain() only needs to remove domain directories.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit f4e0cd8)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Legacy resctrl features are enumerated by X86_FEATURE_* flags. These may be
overridden by quirks to disable features in the case of errata.  Users can use
kernel command line options to either disable a feature, or to force enable
a feature that was disabled by a quirk.

A different approach is needed for hardware features that do not have an
X86_FEATURE_* flag.

Update parsing of the "rdt=" boot parameter to call the telemetry driver
directly to handle new "perf" and "energy" options that controls activation of
telemetry monitoring of the named type. By itself a "perf" or "energy" option
controls the forced enabling or disabling (with ! prefix) of all event groups
of the named type. A ":guid" suffix allows for fine grained control per event
group.

  [ bp: s/intel_aet_option/intel_handle_aet_option/g ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(backported from commit 842e7f9)
[fenghuay: fix a minor conflict in kernel-parameters.txt doc]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
There are now three meanings for "number of RMIDs":

1) The number for legacy features enumerated by CPUID leaf 0xF. This is the
   maximum number of distinct values that can be loaded into MSR_IA32_PQR_ASSOC.
   Note that systems with Sub-NUMA Cluster mode enabled will force scaling down
   the CPUID enumerated value by the number of SNC nodes per L3-cache.

2) The number of registers in MMIO space for each event. This is enumerated in
   the XML files and is the value initialized into event_group::num_rmid.

3) The number of "hardware counters" (this isn't a strictly accurate
   description of how things work, but serves as a useful analogy that does
   describe the limitations) feeding to those MMIO registers. This is enumerated
   in telemetry_region::num_rmids returned by intel_pmt_get_regions_by_feature().

Event groups with insufficient "hardware counters" to track all RMIDs are
difficult for users to use, since the system may reassign "hardware counters"
at any time. This means that users cannot reliably collect two consecutive
event counts to compute the rate at which events are occurring.

Disable such event groups by default. The user may override this with
a command line "rdt=" option. In this case limit an under-resourced event
group's number of possible monitor resource groups to the lowest number of
"hardware counters".

Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG resource
"num_rmid" value to the smallest of these values as this value will be used
later to compare against the number of RMIDs supported by other resources to
determine how many monitoring resource groups are supported.

N.B. Change type of resctrl_mon::num_rmid to u32 to match its usage and the
type of event_group::num_rmid so that min(r->num_rmid, e->num_rmid) won't
complain about mixing signed and unsigned types.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 67640e3)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
closid_num_dirty_rmid[] and rmid_ptrs[] are allocated together during resctrl
initialization and freed together during resctrl exit.

Telemetry events are enumerated on resctrl mount so only at resctrl mount will
the number of RMID supported by all monitoring resources and needed as size
for rmid_ptrs[] be known.

Separate closid_num_dirty_rmid[] and rmid_ptrs[] allocation and free in
preparation for rmid_ptrs[] to be allocated on resctrl mount.

Keep the rdtgroup_mutex protection around the allocation and free of
closid_num_dirty_rmid[] as ARM needs this to guarantee memory ordering.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit ee7f6af)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
resctrl assumes that only the L3 resource supports monitor events, so it
simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as the system's
number of RMIDs.

The addition of telemetry events in a different resource breaks that
assumption.

Compute the number of available RMIDs as the minimum value across all
mon_capable resources (analogous to how the number of CLOSIDs is computed
across alloc_capable resources).

Note that mount time enumeration of the telemetry resource means that
this number can be reduced. If this happens, then some memory will
be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and
rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization will
be larger than needed.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 0ecc988)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
L3 monitor features are enumerated during resctrl initialization and
rmid_ptrs[] that tracks all RMIDs and depends on the number of supported
RMIDs is allocated during this time.

Telemetry monitor features are enumerated during first resctrl mount and
may support a different number of RMIDs compared to L3 monitor features.

Delay allocation and initialization of rmid_ptrs[] until first mount.
Since the number of RMIDs cannot change on later mounts, keep the same set of
rmid_ptrs[] until resctrl_exit(). This is required because the limbo handler
keeps running after resctrl is unmounted and needs to access rmid_ptrs[]
as it keeps tracking busy RMIDs after unmount.

Rename routines to match what they now do:
dom_data_init() -> setup_rmid_lru_list()
dom_data_exit() -> free_rmid_lru_list()

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(backported from commit d089164)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
[fenghuay: fix minor conflicts in setup_rmid_lru_list() and dom_data_exit()]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Since telemetry events are enumerated on resctrl mount the RDT_RESOURCE_PERF_PKG
resource is not considered "monitoring capable" during early resctrl initialization.
This means that the domain list for RDT_RESOURCE_PERF_PKG is not built when the CPU
hotplug notifiers are registered and run for the first time right after resctrl
initialization.

Mark the RDT_RESOURCE_PERF_PKG as "monitoring capable" upon successful telemetry
event enumeration to ensure future CPU hotplug events include this resource and
initialize its domain list for CPUs that are already online.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 4bbfc90)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Update resctrl filesystem documentation with the details about the resctrl
files that support telemetry events.

  [ bp: Drop the debugfs hunk of the documentation until a better debugging
    solution is found. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit a8848c4)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…rl L3 domain and arch API updates

Upstream resctrl renamed the L3 monitor domain type and extended the arch
hooks:
1. Use struct rdt_l3_mon_domain in MPAM's resctrl integration,
2. Pass struct rdt_domain_hdr * into resctrl_online_mon_domain() /
   resctrl_offline_mon_domain(),
3. Match the new resctrl_arch_rmid_read() prototype (header pointer +
   arch_priv).
4. Update resctrl_arch_cntr_read(), resctrl_arch_reset_rmid(),
   resctrl_arch_reset_cntr(), and resctrl_arch_config_cntr() to take
   struct rdt_l3_mon_domain *.
5. Call the new resctrl_enable_mon_event() signature when wiring monitor
   events and set mon_capable from its return value.
6. Add a no-op resctrl_arch_pre_mount() so MPAM builds with the generic
   resctrl mount path.

Fixes: a42549e ("NVIDIA: SAUCE: arm_mpam: resctrl: Add boilerplate cpuhp and domain allocation")
Fixes: ae2a29c ("NVIDIA: SAUCE: arm_mpam: resctrl: Add support for csu counters")
Fixes: 1cbc0f2 ("NVIDIA: SAUCE: arm_mpam: resctrl: Add resctrl_arch_config_cntr() for ABMC use")
Fixes: dd44394 ("NVIDIA: SAUCE: arm_mpam: resctrl: Add resctrl_arch_rmid_read() and resctrl_arch_reset_rmid()")
Fixes: 8429670 ("NVIDIA: SAUCE: arm_mpam: resctrl: Add resctrl_arch_cntr_read() & resctrl_arch_reset_cntr()")

Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…rors

No need to destory MSC instance for the user/admin programming errors
sicne it's not causing any functional issues.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(cherry picked from 316e5833ccb2ef66f50290e48c45b70bf286c8fd dev/dev-main-nvidia-pset-linux-6.19.6)
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
In a NUMA system, each node may include CPUs, memory, MPAM MSC
instances, or any combination thereof. Some high-end servers may
have NUMA nodes that include MPAM MSC but no CPUs. In such cases,
associate all possible CPUs for those MSCs.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(cherry picked from f902b5abf39fe10a50b7062dc9ae9d2cfc723248 dev/dev-main-nvidia-pset-linux-6.19.6)
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…ring domain setup

The current MPAM driver only considers the first component associated
with an online/offline CPU during domain creation and teardown. This
is insufficient, as CPU-initiated traffic may traverse multiple MSCs
before reaching the target, and each MSC must be programmed consistently
for proper resource partitioning.

Update the MPAM driver to include all components associated with a
given CPU during domain setup/teardown to expose expected schemata
to userspace for effective resource control.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(backported from 4309ce9856f87170670c9db40546d9f2fc9dbb86 dev/dev-main-nvidia-pset-linux-6.19.6)
[fenghuay: In addition to the core change, this backport includes the
following adaptations to bridge the gap between the 24.04 (6.17) MPAM
driver and the 6.19.6 base the original was written against:

  - Add for_each_mpam_resctrl_control() and for_each_mpam_resctrl_mon()
    iteration macros (from pset c15c066 and 4f42221)
  - Add MPAM_MAX_EVENT constant to bound the monitor event array
  - Add traffic_matches_l3() to validate that a memory-class MSC's
    traffic matches L3 egress topology (from pset ebc0760)
    Remove redundant if (class->type != MPAM_CLASS_MEMORY)
  - Replace exposed_alloc_capable/exposed_mon_capable static bools
    with dynamic resctrl_arch_alloc_capable()/resctrl_arch_mon_capable()
    that iterate over resources
  - Change mpam_resctrl_offline_cpu() return type from int to void
  - Change mpam_resctrl_monitor_init() return type from void to int
    and propagate errors
  - Change num_rmid from mpam_pmg_max + 1 to
    resctrl_arch_system_num_rmid_idx()
  - Use guard(mutex) for domain_list_lock
  - Use INIT_LIST_HEAD_RCU for domain lists
  - Fix not found mba issue on GMEM by only checking traffic_matches_l3() in
    mpam_resctrl_pick_mba() on class that doesn't have NUMA node]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…onfig

Reset an RIS by building a default mpam_config and applying it via
mpam_reprogram_ris_partid(), like any other config.

- mpam_init_reset_cfg(): set features and default values only for
  controls supported by the RIS (cpor_part, mbw_part, mbw_max,
  mbw_prop, cmax_cmax, cmax_cmin). Use full masks for CPBM/MBW_PBM
  and MPAMCFG_* defaults for MBW_MAX, CMAX, CMIN.
- mpam_reprogram_ris_partid(): apply cfg for all supported controls
  (no separate reset path).

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(backported from c076b208842db87ed50b1c63cff302975a9c8f67 dev/dev-main-nvidia-pset-linux-6.19.6)
[fenghuay: Fix porting conflicts and compilaton errors.
 Remove this sentence in the commit message to avoid confusion because
 MBW_PROP feature is not supported on Vera/Grace:
 "Include mpam_feat_mbw_prop when supported so MBW_PROP is written to 0
  on reset."]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
There is no struct arm_smmu_domain context for domains configured
with identity mappings. Use the device to obtain the necessary
information to program PARTID and PMGID.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(backported from e5020b38475ef58c5bb3d1a92028d4e0dd7aff4d dev/dev-main-nvidia-pset-linux-6.19.6)
[fenghuay: Koba Ko fixes a typo in iommu_group_get_qos_params():
s/!ops->set_group_qos_params/!ops->get_group_qos_params/]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…n mpam_msmon_read

Resolve mpam_feat_msmon_mbwu to the concrete counter type (31/44/63)
before mpam_has_feature() and before filling the mon_read arg. This
avoids -EOPNOTSUPP when only a specific MBWU feature is set, and
ensures _msmon_read() gets the resolved type in arg.type.

Fixes: 5b91005 ("NVIDIA: SAUCE: arm_mpam: Use long MBWU counters if supported")
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
@fyu1 fyu1 force-pushed the 24.04_linux-nvidia-6.17-next.mpam.extras.fixes2 branch from cc8ab11 to ecd11fd Compare March 25, 2026 22:19
@fyu1
Copy link
Copy Markdown
Collaborator Author

fyu1 commented Mar 25, 2026

I fixed a blocking issue on GMEM test failure in the patch "NVIDIA: VR: SAUCE: arm_mpam: Include all associated MSC components during domain setup" and updated its commit message. Here is the fix patch:
diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c
index f7c2bf8aba99..0accede8cc09 100644
--- a/drivers/resctrl/mpam_resctrl.c
+++ b/drivers/resctrl/mpam_resctrl.c
@@ -1162,7 +1162,9 @@ static void mpam_resctrl_pick_mba(void)
continue;
}

  •   if (!traffic_matches_l3(class)) {
    
  •   /* Check memory at egress from L3 for MSC with L3 */
    
  •   if (!cpumask_equal(&class->affinity, cpu_possible_mask) &&
    
  •       !traffic_matches_l3(class)) {
      	pr_debug("class %u traffic doesn't match L3 egress\n",
      		 class->level);
      	continue;
    

With this fix, I don't see MBA/MBM issue on GMEM test with an engineer built SBIOS enabling GPU MPAM.

If this PR is good for you, please merge it to 6.17 BaseOS.

Thank you very much for your help!

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Mar 25, 2026

I fixed a blocking issue on GMEM test failure in the patch "NVIDIA: VR: SAUCE: arm_mpam: Include all associated MSC components during domain setup" and updated its commit message. Here is the fix patch: diff --git a/drivers/resctrl/mpam_resctrl.c b/drivers/resctrl/mpam_resctrl.c index f7c2bf8aba99..0accede8cc09 100644 --- a/drivers/resctrl/mpam_resctrl.c +++ b/drivers/resctrl/mpam_resctrl.c @@ -1162,7 +1162,9 @@ static void mpam_resctrl_pick_mba(void) continue; }

  •   if (!traffic_matches_l3(class)) {
    
  •   /* Check memory at egress from L3 for MSC with L3 */
    
  •   if (!cpumask_equal(&class->affinity, cpu_possible_mask) &&
    
  •       !traffic_matches_l3(class)) {
      	pr_debug("class %u traffic doesn't match L3 egress\n",
      		 class->level);
      	continue;
    

With this fix, I don't see MBA/MBM issue on GMEM test with an engineer built SBIOS enabling GPU MPAM.

If this PR is good for you, please merge it to 6.17 BaseOS.

Thank you very much for your help!

Re-reviewed and confirmed this was the only change. No issues with the change.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

Acked-by: Jamie Nguyen <jamien@nvidia.com>

@nvmochs nvmochs changed the title Please merge MPAM fixes branch: 24.04 linux nvidia 6.17 next.mpam.extras.fixes2 [linux-nvidia-6.17] Backport MPAM fixes and support for CPU-less NUMA nodes Mar 25, 2026
return false;
}

cpu = cpumask_any_and(&class->affinity, cpu_online_mask);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put a check cpu >= nr_cpu_ids like in function topology_matches_l3.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although adding another sanity checking doesn't hurt, without the sanity checking, there won't be any issue because the next statements will check any invalid cpu anyway:
err = find_l3_equivalent_bitmask(cpu, tmp_cpumask);
if (err) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for looking.

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Mar 25, 2026

PR sent to Canonical.

@nvidia-bfigg nvidia-bfigg force-pushed the 24.04_linux-nvidia-6.17-next branch from 9364d8b to 8dab82a Compare April 2, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.