[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support by JiandiAnNVIDIA · Pull Request #342 · NVIDIA/NV-Kernels

JiandiAnNVIDIA · 2026-03-12T15:20:50Z

Description

This patch series adds comprehensive CXL (Compute Express Link) support to the
nvidia-6.17 kernel, including:

CXL Type-2 device support - Enables accelerator devices (like GPUs and
SmartNICs) to use CXL for coherent memory access via firmware-provisioned
regions
CXL RAS (Reliability, Availability, Serviceability) error handling -
Implements PCIe Port Protocol error handling and logging for CXL Root Ports,
Downstream Switch Ports, and Upstream Switch Ports
CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range
registers and HDM decoder programming across PCI resets and link transitions,
enabling device re-initialization after reset for firmware-provisioned
configurations
CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2,
Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices,
including memory offlining, cache flushing, multi-function sibling
coordination, and DVSEC reset sequencing
Multi-level interleaving fix - Supports firmware-configured CXL
interleaving where lower levels use smaller granularities than parent ports
(reverse HPA bit ordering)
Prerequisite CXL and PCI driver updates - Cherry-picked commits from
upstream torvalds/master covering the range from v6.17.9 to the merge
point of Terry Bowman's v14 series into v7.0
CXL DAX support - Enables direct memory access to CXL RAM regions and
mapping CXL DAX devices as System-RAM

Key Features Added:

CXL Type-2 accelerator device registration and memory management
CXL region creation by Type-2 drivers
DPA (Device Physical Address) allocation interface for accelerators
HPA (Host Physical Address) free space enumeration
Multi-level CXL address translation (SPA↔HPA↔DPA)
CXL protocol error detection, forwarding, and recovery
CXL RAS error handling for Endpoints, RCH, and Switch Ports
(replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool)
CXL extended linear cache region support
CXL DVSEC and HDM decoder state save/restore across PCI resets
CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for
Type-2 devices with Reset Capable bit set
Multi-function sibling coordination during CXL reset via Non-CXL
Function Map DVSEC
CPU cache flush using cpu_cache_invalidate_memregion() during reset
Multi-level interleaving with smaller granularities for lower decoder
levels (firmware-provisioned configurations)
CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping
(DEV_DAX_KMEM)
CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL)

Justification

CXL Type-2 device support is critical for next-generation NVIDIA accelerators
and data center workloads:

Enables coherent memory sharing between CPUs and accelerators
Supports firmware-provisioned CXL regions for accelerator memory
Provides proper error handling and reporting for CXL fabric errors
Enables device reset and state recovery for CXL Type-2 devices
Preserves firmware-programmed DVSEC and HDM decoder state across resets
Required for upcoming NVIDIA hardware with CXL capabilities

Source

Patch Breakdown (139 patches + 1 revert):

#	Category	Count	Source
1	Revert old CXL reset (`f198764`)	1	OOT (cleanup)
2	Upstream CXL/PCI prerequisite cherry-picks	103	Upstream `torvalds/master` (v6.17.9 → merge of Terry Bowman v14 into v7.0)
3	Smita Koralahalli's CXL EINJ series v6 patch 3/9	1	LKML (v6, not yet merged)
4	Alejandro Lucero's CXL Type-2 series v23	22	LKML (v23, not yet merged)
5	Robert Richter's multi-level interleaving fix	1	LKML (v1, not yet merged)
6	Srirangan Madhavan's CXL state save/restore series	5	LKML (v1, not yet merged)
7	Srirangan Madhavan's CXL reset series	7	LKML (v5, not yet merged)
8	Config annotations update	3	OOT (build config)
	TOTAL	143

Notes on the upstream cherry-picks (item 2):

The 103 upstream commits span 1bfd0faa78d0 (v6.17.9) to
0da3050bdded (Merge of for-7.0/cxl-aer-prep into cxl-for-next).
This range includes 17 out of 34 patches from Terry Bowman's v14 series
that were reworked by the CXL maintainer and merged into v7.0 via the
for-7.0/cxl-aer-prep branch. The remaining 17 patches from Terry's v14
were refactored into v15 (9 patches, not yet merged) and are not included
in this port.

Notes on the save/restore and reset series (items 6–7):

Srirangan's patches were authored against upstream v7.0-rc1 (which does not
include Alejandro's v23 Type-2 series). For this port, the header
reorganization in patch 2/5 of the save/restore series was adapted to align
with Alejandro's v23 approach: HDM decoder and register map definitions were
moved to include/cxl/cxl.h (not include/cxl/pci.h as in the original
patch) to follow the convention established by Alejandro's series. Upstream
reviewers have indicated that Srirangan's series should be rebased on top of
Alejandro's once it merges.

Lore Links:

Terry Bowman's CXL RAS series (v14, partially merged into v7.0):
https://lore.kernel.org/all/20260114182055.46029-1-terry.bowman@amd.com/
Smita Koralahalli's CXL EINJ series (v6, patch 3/9 only):
https://lore.kernel.org/linux-cxl/20260210064501.157591-1-Smita.KoralahalliChannabasappa@amd.com/
Alejandro Lucero's CXL Type-2 series (v23):
https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/
Robert Richter's multi-level interleaving fix (v1):
https://lore.kernel.org/all/20251028094754.72816-1-rrichter@amd.com/
Srirangan Madhavan's CXL state save/restore series:
https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/
Srirangan Madhavan's CXL reset series (v5):
https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/

Upstream Status:

Series	Status
103 upstream cherry-picks	✅ Merged in `torvalds/master` (v7.0 range)
Terry Bowman v14 (17 patches)	✅ Merged into v7.0 via `for-7.0/cxl-aer-prep`
Terry Bowman v15 (9 patches)	⏳ Under review, not needed for this port
Smita v6 patch 3/9	⏳ Under review, not yet merged
Alejandro v23 (22 patches)	⏳ Under review, not yet merged
Robert Richter v1 (1 patch)	⏳ Under review, not yet merged
Srirangan save/restore (5 patches)	⏳ Under review, not yet merged
Srirangan cxl_reset v5 (7 patches)	⏳ Under review, not yet merged

Testing

Build Validation:

Built successfully for ARM64 4K page size kernel
Built successfully for ARM64 64K page size kernel

Config Verification:

CXL-related configs enabled as expected:

CONFIG_ACPI_APEI_EINJ_CXL=y
CONFIG_PCI_CXL=y
CONFIG_CXL_BUS=y
CONFIG_CXL_PCI=y
CONFIG_CXL_MEM_RAW_COMMANDS=y
CONFIG_CXL_ACPI=m
CONFIG_CXL_PMEM=m
CONFIG_CXL_MEM=y
CONFIG_CXL_FEATURES=y
# CONFIG_CXL_EDAC_MEM_FEATURES is not set
CONFIG_CXL_PORT=y
CONFIG_CXL_SUSPEND=y
CONFIG_CXL_REGION=y
# CONFIG_CXL_REGION_INVALIDATION_TEST is not set
CONFIG_CXL_RAS=y
# CONFIG_CACHEMAINT_FOR_HOTPLUG is not set
# CONFIG_SFC_CXL is not set
CONFIG_CXL_PMU=m
CONFIG_DEV_DAX=y
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_HMEM=m
CONFIG_DEV_DAX_CXL=y
CONFIG_DEV_DAX_HMEM_DEVICES=y
CONFIG_DEV_DAX_KMEM=y
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=y
CONFIG_GENERIC_CPU_CACHE_MAINTENANCE=y

Runtime Testing:

Boot test on ARM64 system
CXL device enumeration test (ls /sys/bus/cxl/devices/)
CXL interleaving testing
CXL reset test (echo 1 > /sys/bus/pci/devices/<dev>/cxl_reset)
DVSEC save/restore verified (CXLCtl, Range registers preserved)

Notes

CONFIG_PCIEAER_CXL has been removed from Kconfig by upstream commit
d18f1b7beadf (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS).
The debian.master annotation for PCIEAER_CXL=y is overridden to -
in debian.nvidia-6.17/config/annotations.
CONFIG_CXL_BUS, CONFIG_CXL_PCI, CONFIG_CXL_MEM, CONFIG_CXL_PORT
remain tristate (not bool) — the v14 series kept them as tristate,
unlike earlier draft versions.
CONFIG_DEV_DAX, CONFIG_DEV_DAX_CXL, and CONFIG_DEV_DAX_KMEM are
overridden from m (debian.master default) to y to support built-in
CXL RAM region DAX access and System-RAM mapping.
CONFIG_PCI_CXL is a new hidden bool introduced by the save/restore
series; auto-enabled when CXL_BUS=y. Gates compilation of
drivers/pci/cxl.o for DVSEC and HDM state save/restore.
CONFIG_GENERIC_CPU_CACHE_MAINTENANCE and
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION are new configs
introduced by the upstream cherry-picks; arm64 auto-selects both.
cpu_cache_invalidate_memregion() is also used by the CXL reset
series for cache flushing during reset.
Kernel config annotations updated in debian.nvidia-6.17/config/annotations
to reflect all of the above changes.
Srirangan's save/restore series header reorganization was adapted to
align with Alejandro's v23 approach (include/cxl/cxl.h instead of
include/cxl/pci.h). See commit message on patch 2/5 for details.

BugLink: https://bugs.launchpad.net/bugs/2139315 This reverts commit 93c54a5 so that it can be replaced by the upstream equivalents. Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Add cpu part and model macro definitions for NVIDIA Olympus core. Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit e185c8a) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Add the part number and MIDR for NVIDIA Olympus. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> (cherry picked from commit d5e4c71) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Add NVIDIA Olympus MIDR to neoverse_spe range list. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> (backported from commit d852b83) [mochs: Minor context cleanup due to absence of "perf arm_spe: Add CPU variants supporting common data source packet"] Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 The documentation in nvidia-pmu.rst contains PMUs specific to NVIDIA Tegra241 SoC. Rename the file for this specific SoC to have better distinction with other NVIDIA SoC. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Adds Unified Coherent Fabric PMU support in Tegra410 SOC. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Add interface to get ACPI device associated with the PMU. This ACPI device may contain additional properties not covered by the standard properties. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Adds PCIE PMU support in Tegra410 SOC. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Adds PCIE-TGT PMU support in Tegra410 SOC. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Adds CPU Memory (CMEM) Latency PMU support in Tegra410 SOC. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Adds NVIDIA C2C PMU support in Tegra410 SOC. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139315 Enable driver for NVIDIA TEGRA410 CMEM Latency and C2C PMU device. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

… events BugLink: https://bugs.launchpad.net/bugs/2139315 Add JSON files for NVIDIA Tegra410 Olympus core PMU events. Also updated the common-and-microarch.json. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> (backported from https://lore.kernel.org/all/20260127225909.3296202-1-bwicaksono@nvidia.com/) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

…EGRA410_CMEM_LATENCY_PMU BugLink: https://bugs.launchpad.net/bugs/2139315 Set the following kconfigs to enable these PMUs on T410: CONFIG_NVIDIA_TEGRA410_C2C_PMU=m CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU=m Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2138131 Properly pass the variadic arguments so it can be called with or without them depending on the format. Signed-off-by: Lucas De Marchi <ldemarchi@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2140343 A vDEVICE has been a hard requirement for attaching a nested domain to the device. This makes sense when installing a guest STE, since a vSID must be present and given to the kernel during the vDEVICE allocation. But, when CR0.SMMUEN is disabled, VM doesn't really need a vSID to program the vSMMU behavior as GBPA will take effect, in which case the vSTE in the nested domain could have carried the bypass or abort configuration in GBPA register. Thus, having such a hard requirement doesn't work well for GBPA. Skip vmaster allocation in arm_smmu_attach_prepare_vmaster() for an abort or bypass vSTE. Note that device on this attachment won't report vevents. Update the uAPI doc accordingly. Link: https://patch.msgid.link/r/20251103172755.2026145-1-nicolinc@nvidia.com Tested-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> (backported from commit 81c45c6) Signed-off-by: Nathan Chen <nathanc@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2140343 The function hugetlb_reserve_pages() returns the number of pages added to the reservation map on success and a negative error code on failure (e.g. -EINVAL, -ENOMEM). However, in some error paths, it may return -1 directly. For example, a failure at: if (hugetlb_acct_memory(h, gbl_reserve) < 0) goto out_put_pages; results in returning -1 (since add = -1), which may be misinterpreted in userspace as -EPERM. Fix this by explicitly capturing and propagating the return values from helper functions, and using -EINVAL for all other failure cases. Link: https://lkml.kernel.org/r/20251125171350.86441-1-skolothumtho@nvidia.com Fixes: 986f5f2 ("mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updated") Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew R. Ochs <mochs@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (backported from commit 9ee5d17) Signed-off-by: Nathan Chen <nathanc@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2140343 The Enable bits in CMDQV/VINTF/VCMDQ_CONFIG registers do not actually reset the HW registers. So, the driver explicitly clears all the registers when a VINTF or VCMDQ is being initialized calling its hw_deinit() function. However, a userspace VCMDQ is not properly reset, unlike an in-kernel VCMDQ getting reset in tegra241_vcmdq_hw_init(). Meanwhile, tegra241_vintf_hw_init() calling tegra241_vintf_hw_deinit() will not deinit any VCMDQ, since there is no userspace VCMDQ mapped to the VINTF at that stage. Then, this may result in dirty VCMDQ registers, which can fail the VM. Like tegra241_vcmdq_hw_init(), reset a VCMDQ in tegra241_vcmdq_hw_init() to fix this bug. This is required by a host kernel. Fixes: 6717f26ab1e7 ("iommu/tegra241-cmdqv: Add user-space use support") Cc: stable@vger.kernel.org Reported-by: Bao Nguyen <ncqb@google.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> (backported from commit 80f1a2c) Signed-off-by: Nathan Chen <nathanc@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

… transfer BugLink: https://bugs.launchpad.net/bugs/2139640 When the ISR thread wakes up late and finds that the timeout handler has already processed the transfer (curr_xfer is NULL), return IRQ_HANDLED instead of IRQ_NONE. Use a similar approach to tegra_qspi_handle_timeout() by reading QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the hardware actually completed the transfer. If QSPI_RDY is set, the interrupt was legitimate and triggered by real hardware activity. The fact that the timeout path handled it first doesn't make it spurious. Returning IRQ_NONE incorrectly suggests the interrupt wasn't for this device, which can cause issues with shared interrupt lines and interrupt accounting. Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/20260126-tegra_xfer-v2-1-6d2115e4f387@debian.org Signed-off-by: Mark Brown <broonie@kernel.org> (cherry picked from commit aabd8ea linux-next) Signed-off-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139640 Move the assignment of the transfer pointer from curr_xfer inside the spinlock critical section in both handle_cpu_based_xfer() and handle_dma_based_xfer(). Previously, curr_xfer was read before acquiring the lock, creating a window where the timeout path could clear curr_xfer between reading it and using it. By moving the read inside the lock, the handlers are guaranteed to see a consistent value that cannot be modified by the timeout path. Fixes: 921fc18 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Thierry Reding <treding@nvidia.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Jon Hunter <jonathanh@nvidia.com> Link: https://patch.msgid.link/20260126-tegra_xfer-v2-2-6d2115e4f387@debian.org Signed-off-by: Mark Brown <broonie@kernel.org> (cherry picked from commit ef13ba3 linux-next) Signed-off-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

…transfer_one BugLink: https://bugs.launchpad.net/bugs/2139640 When the timeout handler processes a completed transfer and signals completion, the transfer thread can immediately set up the next transfer and assign curr_xfer to point to it. If a delayed ISR from the previous transfer then runs, it checks if (!tqspi->curr_xfer) (currently without the lock also -- to be fixed soon) to detect stale interrupts, but this check passes because curr_xfer now points to the new transfer. The ISR then incorrectly processes the new transfer's context. Protect the curr_xfer assignment with the spinlock to ensure the ISR either sees NULL (and bails out) or sees the new value only after the assignment is complete. Fixes: 921fc18 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller") Signed-off-by: Breno Leitao <leitao@debian.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/20260126-tegra_xfer-v2-3-6d2115e4f387@debian.org Signed-off-by: Mark Brown <broonie@kernel.org> (cherry picked from commit f5a4d7f linux-next) Signed-off-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139640 The curr_xfer field is read by the IRQ handler without holding the lock to check if a transfer is in progress. When clearing curr_xfer in the combined sequence transfer loop, protect it with the spinlock to prevent a race with the interrupt handler. Protect the curr_xfer clearing at the exit path of tegra_qspi_combined_seq_xfer() with the spinlock to prevent a race with the interrupt handler that reads this field. Without this protection, the IRQ handler could read a partially updated curr_xfer value, leading to NULL pointer dereference or use-after-free. Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling") Signed-off-by: Breno Leitao <leitao@debian.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/20260126-tegra_xfer-v2-4-6d2115e4f387@debian.org Signed-off-by: Mark Brown <broonie@kernel.org> (cherry picked from commit bf4528a linux-next) Signed-off-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

…ined_seq_xfer BugLink: https://bugs.launchpad.net/bugs/2139640 Protect the curr_xfer clearing in tegra_qspi_non_combined_seq_xfer() with the spinlock to prevent a race with the interrupt handler that reads this field to check if a transfer is in progress. Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling") Signed-off-by: Breno Leitao <leitao@debian.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/20260126-tegra_xfer-v2-5-6d2115e4f387@debian.org Signed-off-by: Mark Brown <broonie@kernel.org> (cherry picked from commit 6d7723e linux-next) Signed-off-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139640 Now that all other accesses to curr_xfer are done under the lock, protect the curr_xfer NULL check in tegra_qspi_isr_thread() with the spinlock. Without this protection, the following race can occur: CPU0 (ISR thread) CPU1 (timeout path) ---------------- ------------------- if (!tqspi->curr_xfer) // sees non-NULL spin_lock() tqspi->curr_xfer = NULL spin_unlock() handle_*_xfer() spin_lock() t = tqspi->curr_xfer // NULL! ... t->len ... // NULL dereference! With this patch, all curr_xfer accesses are now properly synchronized. Although all accesses to curr_xfer are done under the lock, in tegra_qspi_isr_thread() it checks for NULL, releases the lock and reacquires it later in handle_cpu_based_xfer()/handle_dma_based_xfer(). There is a potential for an update in between, which could cause a NULL pointer dereference. To handle this, add a NULL check inside the handlers after acquiring the lock. This ensures that if the timeout path has already cleared curr_xfer, the handler will safely return without dereferencing the NULL pointer. Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling") Signed-off-by: Breno Leitao <leitao@debian.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/20260126-tegra_xfer-v2-6-6d2115e4f387@debian.org Signed-off-by: Mark Brown <broonie@kernel.org> (cherry picked from commit edf9088 linux-next) Signed-off-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2139648 Currently cpu-clock event always returns 0 count, e.g., perf stat -e cpu-clock -- sleep 1 Performance counter stats for 'sleep 1': 0 cpu-clock # 0.000 CPUs utilized 1.002308394 seconds time elapsed The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle error of some clock events")' adds PERF_EF_UPDATE flag check before calling cpu_clock_event_update() to update the count, however the PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in counting mode (pmu->dev() -> cpu_clock_event_del() -> cpu_clock_event_stop()). This leads to the cpu-clock event count is never updated. To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event just like what task-clock does. Fixes: bc4394e ("perf: Fix the throttle error of some clock events") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com (cherry picked from commit f1f9651) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Nirmoy Das <nirmoyd@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2093957 Signed-off-by: Jeremy Szu <jszu@nvidia.com> Acked-by: Nirmoy Das <nirmoyd@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2138892 Remove this declaration which is now used within the file after merging upstream "vfio/nvgrace-gpu: register device memory for poison handling". Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2136828 Add PCI_VENDOR_ID_ASPEED to the shared pci_ids.h header and remove the duplicate local definition from ehci-pci.c. This prepares for adding a PCI quirk for ASPEED devices. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com> (backported from https://lore.kernel.org/linux-iommu/20251217154529.377586-1-nirmoyd@nvidia.com/) Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

BugLink: https://bugs.launchpad.net/bugs/2136828 ASPEED BMC controllers have VGA and USB functions behind a PCIe-to-PCI bridge that causes them to share the same stream ID: [e0]---00.0-[e1-e2]----00.0-[e2]--+-00.0 ASPEED Graphics Family \-02.0 ASPEED USB Controller Both devices get stream ID 0x5e200 due to bridge aliasing, causing the USB controller to be rejected with 'Aliasing StreamID unsupported'. Per ASPEED, the AST1150 doesn't use a real PCI bus and always forwards the original requester ID from downstream devices rather than replacing it with any alias. Add a new PCI_DEV_FLAGS_PCI_BRIDGE_NO_ALIASES flag and apply it to the AST1150. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com> (backported from https://lore.kernel.org/linux-iommu/20251217154529.377586-2-nirmoyd@nvidia.com/) [nirmoy: set PCI_DEV_FLAGS_PCI_BRIDGE_NO_ALIASES to (1 << 15) instead of (1 << 14)] Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Noah Wager <noah.wager@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

…ivation (LFA) BugLink: https://bugs.launchpad.net/bugs/2138342 The Arm Live Firmware Activation (LFA) is a specification [1] to describe activating firmware components without a reboot. Those components (like TF-A's BL31, EDK-II, TF-RMM, secure paylods) would be updated the usual way: via fwupd, FF-A or other secure storage methods, or via some IMPDEF Out-Of-Bound method. The user can then activate this new firmware, at system runtime, without requiring a reboot. The specification covers the SMCCC interface to list and query available components and eventually trigger the activation. Add a new directory under /sys/firmware to present firmware components capable of live activation. Each of them is a directory under lfa/, and is identified via its GUID. The activation will be triggered by echoing "1" into the "activate" file: ========================================== /sys/firmware/lfa # ls -l . 6c* .: total 0 drwxr-xr-x 2 0 0 0 Jan 19 11:33 47d4086d-4cfe-9846-9b95-2950cbbd5a00 drwxr-xr-x 2 0 0 0 Jan 19 11:33 6c0762a6-12f2-4b56-92cb-ba8f633606d9 drwxr-xr-x 2 0 0 0 Jan 19 11:33 d6d0eea7-fcea-d54b-9782-9934f234b6e4 6c0762a6-12f2-4b56-92cb-ba8f633606d9: total 0 --w------- 1 0 0 4096 Jan 19 11:33 activate -r--r--r-- 1 0 0 4096 Jan 19 11:33 activation_capable -r--r--r-- 1 0 0 4096 Jan 19 11:33 activation_pending --w------- 1 0 0 4096 Jan 19 11:33 cancel -r--r--r-- 1 0 0 4096 Jan 19 11:33 cpu_rendezvous -r--r--r-- 1 0 0 4096 Jan 19 11:33 current_version -rw-r--r-- 1 0 0 4096 Jan 19 11:33 force_cpu_rendezvous -r--r--r-- 1 0 0 4096 Jan 19 11:33 may_reset_cpu -r--r--r-- 1 0 0 4096 Jan 19 11:33 name -r--r--r-- 1 0 0 4096 Jan 19 11:33 pending_version /sys/firmware/lfa/6c0762a6-12f2-4b56-92cb-ba8f633606d9 # grep . * grep: activate: Permission denied activation_capable:1 activation_pending:1 grep: cancel: Permission denied cpu_rendezvous:1 current_version:0.0 force_cpu_rendezvous:1 may_reset_cpu:0 name:TF-RMM pending_version:0.0 /sys/firmware/lfa/6c0762a6-12f2-4b56-92cb-ba8f633606d9 # echo 1 > activate [ 2825.797871] Arm LFA: firmware activation succeeded. /sys/firmware/lfa/6c0762a6-12f2-4b56-92cb-ba8f633606d9 # ========================================== [1] https://developer.arm.com/documentation/den0147/latest/ Signed-off-by: Salman Nabi <salman.nabi@arm.com> Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com> (backported from https://lore.kernel.org/all/20260119122729.287522-2-salman.nabi@arm.com/) [nirmoyd: Added image_name fallback to fw_uuid in update_fw_image_node()] Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Noah Wager <noah.wager@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>

Use cxl api for creating a region using the endpoint decoder related to a DPA range. Signed-off-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> (backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

A PIO buffer is a region of device memory to which the driver can write a packet for TX, with the device handling the transmit doorbell without requiring a DMA for getting the packet data, which helps reducing latency in certain exchanges. With CXL mem protocol this latency can be lowered further. With a device supporting CXL and successfully initialised, use the cxl region to map the memory range and use this mapping for PIO buffers. Add the disabling of those CXL-based PIO buffers if the callback for potential cxl endpoint removal by the CXL code happens. Signed-off-by: Alejandro Lucero <alucerop@amd.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> (backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…smaller granularities for lower levels The CXL specification supports multi-level interleaving "as long as all the levels use different, but consecutive, HPA bits to select the target and no Interleave Set has more than 8 devices" (from 3.2). Currently the kernel expects that a decoder's "interleave granularity is a multiple of @parent_port granularity". That is, the granularity of a lower level is bigger than those of the parent and uses the outer HPA bits as selector. It works e.g. for the following 8-way config: * cross-link (cross-hostbridge config in CFMWS): * 4-way * 256 granularity * Selector: HPA[8:9] * sub-link (CXL Host bridge config of the HDM): * 2-way * 1024 granularity * Selector: HPA[10] Now, if the outer HPA bits are used for the cross-hostbridge, an 8-way config could look like this: * cross-link (cross-hostbridge config in CFMWS): * 4-way * 512 granularity * Selector: HPA[9:10] * sub-link (CXL Host bridge config of the HDM): * 2-way * 256 granularity * Selector: HPA[8] The enumeration of decoders for this configuration fails then with following error: cxl region0: pci0000:00:port1 cxl_port_setup_targets expected iw: 2 ig: 1024 [mem 0x10000000000-0x1ffffffffff flags 0x200] cxl region0: pci0000:00:port1 cxl_port_setup_targets got iw: 2 ig: 256 state: enabled 0x10000000000:0x1ffffffffff cxl_port endpoint12: failed to attach decoder12.0 to region0: -6 Note that this happens only if firmware is setting up the decoders (CXL_REGION_F_AUTO). For userspace region assembly the granularities are chosen to increase from root down to the lower levels. That is, outer HPA bits are always used for lower interleaving levels. Rework the implementation to also support multi-level interleaving with smaller granularities for lower levels. Determine the interleave set of autodetected decoders. Check that it is a subset of the root interleave. The HPA selector bits are extracted for all decoders of the set and checked that there is no overlap and bits are consecutive. All decoders can be programmed now to use any bit range within the region's target selector. Signed-off-by: Robert Richter <rrichter@amd.com> (backported from https://lore.kernel.org/all/20251028094754.72816-1-rrichter@amd.com/) [jan: Resolved minor conflicts] Signed-off-by: Jiandi An <jan@nvidia.com>

…er definitions PCI: Add CXL DVSEC control, lock, and range register definitions Add register offset and field definitions for CXL DVSEC registers needed by CXL state save/restore across resets: - CTRL2 (offset 0x10) and LOCK (offset 0x14) registers - CONFIG_LOCK bit in the LOCK register - RWL (read-write-when-locked) field masks for CTRL and range base registers. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

… to include/cxl/cxl.h Move CXL HDM decoder register defines, register map structs (cxl_reg_map, cxl_component_reg_map, cxl_device_reg_map, cxl_pmu_reg_map, cxl_register_map), cxl_hdm_decoder_count(), enum cxl_regloc_type, and cxl_find_regblock()/cxl_setup_regs() declarations from internal CXL headers to include/cxl/pci.h. This makes them accessible to code outside the CXL subsystem, in particular the PCI core CXL state save/restore support added in a subsequent patch. No functional change. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) [jan: Resolve conflicts by moving certain definitions to include/cxl/cxl.h instead of to include/cxl/pci.h to align with its dependency of Alejandro's series] Signed-off-by: Jiandi An <jan@nvidia.com>

…state Add pci_add_virtual_ext_cap_save_buffer() to allocate save buffers using virtual cap IDs (above PCI_EXT_CAP_ID_MAX) that don't require a real capability in config space. The existing pci_add_ext_cap_save_buffer() cannot be used for CXL DVSEC state because it calls pci_find_saved_ext_cap() which searches for a matching capability in PCI config space. The CXL state saved here is a synthetic snapshot (DVSEC+HDM) and should not be tied to a real extended-cap instance. A virtual extended-cap save buffer API (cap IDs above PCI_EXT_CAP_ID_MAX) allows PCI to track this state without a backing config space capability. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

Save and restore CXL DVSEC control registers (CTRL, CTRL2), range base registers, and lock state across PCI resets. When the DVSEC CONFIG_LOCK bit is set, certain DVSEC fields become read-only and hardware may have updated them. Blindly restoring saved values would be silently ignored or conflict with hardware state. Instead, a read-merge-write approach is used: current hardware values are read for the RWL (read-write-when-locked) fields and merged with saved state, so only writable bits are restored while locked bits retain their hardware values. Hooked into pci_save_state()/pci_restore_state() so all PCI reset paths automatically preserve CXL DVSEC configuration. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) [jan: Resolve minor conflict in drivers/pci/Makefile due to code line shifts ] Signed-off-by: Jiandi An <jan@nvidia.com>

Save and restore CXL HDM decoder registers (global control, per-decoder base/size/target-list, and commit state) across PCI resets. On restore, decoders that were committed are reprogrammed and recommitted with a 10ms timeout. Locked decoders that are already committed are skipped, since their state is protected by hardware and reprogramming them would fail. The Register Locator DVSEC is parsed directly via PCI config space reads rather than calling cxl_find_regblock()/cxl_setup_regs(), since this code lives in the PCI core and must not depend on CXL module symbols. MSE is temporarily enabled during save/restore to allow MMIO access to the HDM decoder register block. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/) [jan: Include <cxl/cxl.h> in drivers/pci/cxl.c due to conflict resolution in "4acbc27592b8 NVIDIA: VR: SAUCE: cxl: Move HDM decoder and register map definitions to include/cxl/cxl.h"] Signed-off-by: Jiandi An <jan@nvidia.com>

…efinitions Add CXL DVSEC register definitions needed for CXL device reset per CXL r3.2 section 8.1.3.1: - Capability bits: RST_CAPABLE, CACHE_CAPABLE, CACHE_WBI_CAPABLE, RST_TIMEOUT, RST_MEM_CLR_CAPABLE - Control2 register: DISABLE_CACHING, INIT_CACHE_WBI, INIT_CXL_RST, RST_MEM_CLR_EN - Status2 register: CACHE_INV, RST_DONE, RST_ERR - Non-CXL Function Map DVSEC register offset Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) [jan: Resolve conflicts where PCI_DVSEC_CXL_CACHE_CAPABLE is already added by "72bd823fb4f1 NVIDIA: VR: SAUCE: PCI: Allow ATS to be always on for CXL.cache capable devices"] Signed-off-by: Jiandi An <jan@nvidia.com>

…_restore() Export pci_dev_save_and_disable() and pci_dev_restore() so that subsystems performing non-standard reset sequences (e.g. CXL) can reuse the PCI core standard pre/post reset lifecycle: driver reset_prepare/reset_done callbacks, PCI config space save/restore, and device disable/re-enable. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

Add infrastructure for quiescing the CXL data path before reset: - Memory offlining: check if CXL-backed memory is online and offline it via offline_and_remove_memory() before reset, per CXL spec requirement to quiesce all CXL.mem transactions before issuing CXL Reset. - CPU cache flush: invalidate cache lines before reset as a safety measure after memory offline. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…XL reset Add sibling PCI function save/disable/restore coordination for CXL reset. Before reset, all CXL.cachemem sibling functions are locked, saved, and disabled; after reset they are restored. The Non-CXL Function Map DVSEC and per-function DVSEC capability register are consulted to skip non-CXL and CXL.io-only functions. A global mutex serializes concurrent resets to prevent deadlocks between sibling functions. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…ration cxl_dev_reset() implements the hardware reset sequence: optionally enable memory clear, initiate reset via CTRL2, wait for completion, and re-enable caching. cxl_do_reset() orchestrates the full reset flow: 1. CXL pre-reset: mem offlining and cache flush (when memdev present) 2. PCI save/disable: pci_dev_save_and_disable() automatically saves CXL DVSEC and HDM decoder state via PCI core hooks 3. Sibling coordination: save/disable CXL.cachemem sibling functions 4. Execute CXL DVSEC reset 5. Sibling restore: always runs to re-enable sibling functions 6. PCI restore: pci_dev_restore() automatically restores CXL state The CXL-specific DVSEC and HDM save/restore is handled by the PCI core's CXL save/restore infrastructure (drivers/pci/cxl.c). Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

Add a "cxl_reset" sysfs attribute to PCI devices that support CXL Reset (CXL r3.2 section 8.1.3.1). The attribute is visible only on devices with both CXL.cache and CXL.mem capabilities and the CXL Reset Capable bit set in the DVSEC. Writing "1" to the attribute triggers the full CXL reset flow via cxl_do_reset(). The interface is decoupled from memdev creation: when a CXL memdev exists, memory offlining and cache flush are performed; otherwise reset proceeds without the memory management. The sysfs attribute is managed entirely by the CXL module using sysfs_create_group() / sysfs_remove_group() rather than the PCI core's static attribute groups. This avoids cross-module symbol dependencies between the PCI core (always built-in) and CXL_BUS (potentially modular). At module init, existing PCI devices are scanned and a PCI bus notifier handles hot-plug/unplug. kernfs_drain() makes sure that any in-flight store() completes before sysfs_remove_group() returns, preventing use-after-free during module unload. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…tribute Document the cxl_reset sysfs attribute added to PCI devices that support CXL Reset. Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com> (backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/) Signed-off-by: Jiandi An <jan@nvidia.com>

…and RAS support Add Ubuntu kernel config annotations for CXL-related configs introduced or changed by the following cherry-picked patch series: - drivers/cxl changes between v6.17.9 and upstream 7.0 (which includes a portion of Terry Bowman's v14 CXL RAS series merged via for-7.0/cxl-aer-prep) - Alejandro Lucero's v23 CXL Type-2 device support series - Smita Koralahalli's v6 patch 3/9 (cxl/region: Skip decoder reset on detach for autodiscovered regions) CONFIG_CXL_BUS: Enable CXL bus support built-in; required for CXL Type-2 device and RAS support CONFIG_CXL_PCI: Enable CXL PCI management built-in; auto-selects CXL_MEM; required for CXL Type-2 device support CONFIG_CXL_MEM: Auto-selected by CXL_PCI; required for CXL memory expansion and Type-2 device support CONFIG_CXL_PORT: Required for CXL port enumeration; defaults to CXL_BUS value CONFIG_FWCTL: Selected by CXL_BUS when CXL_FEATURES is enabled; required for CXL feature mailbox access CONFIG_CXL_RAS: New def_bool replacing PCIEAER_CXL (Terry Bowman v14); auto-enabled with ACPI_APEI_GHES+PCIEAER+ CXL_BUS for CXL RAS error handling CONFIG_SFC_CXL: Solarflare SFC9100-family CXL Type-2 device support; not needed for NVIDIA platforms (n) CONFIG_ACPI_APEI_EINJ: Required prerequisite for CONFIG_ACPI_APEI_EINJ_CXL CONFIG_ACPI_APEI_EINJ_CXL: CXL protocol error injection support via APEI EINJ CONFIG_PCIEAER_CXL: Remove it from debian.master policy. This config was removed from Kconfig by upstream commit d18f1b7 (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS) which is included in this port. CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION: Override debian.master amd64-only policy to include arm64. Commit 4d873c5 added 'select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION' to arch/arm64/Kconfig, making this y on arm64 as well. CONFIG_GENERIC_CPU_CACHE_MAINTENANCE: New bool config defined by c460697 in lib/Kconfig. Selected by arm64 via 4d873c5; not selected by x86. Set arm64: y, amd64: -. CONFIG_CACHEMAINT_FOR_HOTPLUG: New optional menuconfig defined by 2ec3b54 in drivers/cache/Kconfig. Depends on GENERIC_CPU_CACHE_MAINTENANCE so becomes visible on arm64. Defaults to n; HiSilicon HHA driver not needed for NVIDIA platforms. Set arm64: n, amd64: -. Signed-off-by: Jiandi An <jan@nvidia.com>

…memory access Override debian.master policy (m->y) for DEV_DAX, DEV_DAX_CXL, and DEV_DAX_KMEM to ensure CXL memory regions are accessible as both raw DAX devices and hotplugged System-RAM nodes. debian.master sets these to 'm' (modules). For NVIDIA platforms with CXL Type-2 devices, built-in (y) is required to ensure CXL memory regions provisioned early in boot are immediately accessible without relying on module loading order. CONFIG_DEV_DAX: Override m->y; prerequisite for DEV_DAX_CXL and DEV_DAX_KMEM to be built-in; depends on TRANSPARENT_HUGEPAGE (already y in debian.master) CONFIG_DEV_DAX_CXL: Override m->y; creates /dev/daxX.Y devices for CXL RAM regions not in the default system memory map (Soft Reserved or dynamically provisioned regions); depends on CXL_BUS+CXL_REGION+DEV_DAX (all y) CONFIG_DEV_DAX_KMEM: Override m->y; onlines CXL DAX devices as System-RAM NUMA nodes via memory hotplug, making CXL memory available for normal kernel and userspace allocation Signed-off-by: Jiandi An <jan@nvidia.com>

…/restore Add Ubuntu kernel config annotation for CONFIG_PCI_CXL introduced by the CXL DVSEC and HDM state save/restore series (Srirangan Madhavan). CONFIG_PCI_CXL: Hidden bool in drivers/pci/Kconfig; auto-enabled when CXL_BUS=y. Gates compilation of drivers/pci/cxl.o which saves and restores CXL DVSEC control/range registers and HDM decoder state across PCI resets and link transitions. Signed-off-by: Jiandi An <jan@nvidia.com>

JiandiAnNVIDIA · 2026-03-24T18:09:02Z

This patch "PCI: Update CXL DVSEC definitions" missed one rename

nvidia@localhost:/home/nvidia/NV-Kernels$ make
  CALL    scripts/checksyscalls.sh
  CC      drivers/pci/ats.o
drivers/pci/ats.c: In function ‘pci_cxl_ats_always_on’:
drivers/pci/ats.c:221:44: error: ‘CXL_DVSEC_PCIE_DEVICE’ undeclared (first use in this function); did you mean ‘PCI_DVSEC_CXL_DEVICE’?
  221 |                                            CXL_DVSEC_PCIE_DEVICE);
      |                                            ^~~~~~~~~~~~~~~~~~~~~
      |                                            PCI_DVSEC_CXL_DEVICE
drivers/pci/ats.c:221:44: note: each undeclared identifier is reported only once for each function it appears in
drivers/pci/ats.c:225:45: error: ‘CXL_DVSEC_CAP_OFFSET’ undeclared (first use in this function)
  225 |         pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET, &cap);
      |                                             ^~~~~~~~~~~~~~~~~~~~
make[4]: *** [scripts/Makefile.build:287: drivers/pci/ats.o] Error 1
make[3]: *** [scripts/Makefile.build:556: drivers/pci] Error 2
make[2]: *** [scripts/Makefile.build:556: drivers] Error 2
make[1]: *** [/home/nvidia/NV-Kernels/Makefile:2016: .] Error 2
make: *** [Makefile:248: __sub-make] Error 2

Fixed.

clsotog · 2026-03-24T19:22:37Z

I see the compiling issue is fixed. That was my concern.

nvmochs

I reviewed the name change fix in "PCI: Update CXL DVSEC definitions" and confirmed it builds successfully for arm64.

No further issues or concerns from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

clsotog

Acked-by: Carol L Soto <csoto@nvidia.com>

nirmoy · 2026-03-25T14:31:21Z

Tried this on GB300 yesterday with the complilation issue manually fixed. Ran CUDA DVS tests like http://10.112.214.250:8002/. We still need to make sure that there are no regression with older RM driver. With that
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>

nvmochs · 2026-03-30T14:59:15Z

PR sent to Canonical.

nvmochs and others added 30 commits February 13, 2026 16:49

alucerop and others added 18 commits March 24, 2026 11:58

JiandiAnNVIDIA force-pushed the cxl_2026-03-04 branch from 1de21c1 to acf188b Compare March 24, 2026 18:07

nvmochs self-requested a review March 24, 2026 19:37

nvmochs approved these changes Mar 24, 2026

View reviewed changes

clsotog self-requested a review March 24, 2026 19:58

clsotog approved these changes Mar 24, 2026

View reviewed changes

nvax-r mentioned this pull request Mar 25, 2026

[linux-nvidia-6.17-next] Backport TLB flush optimizations and fixes for ARM64 #350

Closed

JiandiAnNVIDIA changed the title ~~[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, and state save/restore~~ [linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support Mar 30, 2026

nvidia-bfigg force-pushed the 24.04_linux-nvidia-6.17-next branch from 9364d8b to 8dab82a Compare April 2, 2026 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support#342

[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support#342
JiandiAnNVIDIA wants to merge 642 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl_2026-03-04

JiandiAnNVIDIA commented Mar 12, 2026 •

edited

Loading

Uh oh!

JiandiAnNVIDIA commented Mar 24, 2026

Uh oh!

clsotog commented Mar 24, 2026

Uh oh!

nvmochs left a comment

Uh oh!

clsotog left a comment

Uh oh!

nirmoy commented Mar 25, 2026

Uh oh!

nvmochs commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

JiandiAnNVIDIA commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Features Added:

Justification

Source

Patch Breakdown (139 patches + 1 revert):

Notes on the upstream cherry-picks (item 2):

Notes on the save/restore and reset series (items 6–7):

Lore Links:

Upstream Status:

Testing

Build Validation:

Config Verification:

Runtime Testing:

Notes

Uh oh!

JiandiAnNVIDIA commented Mar 24, 2026

Uh oh!

clsotog commented Mar 24, 2026

Uh oh!

nvmochs left a comment

Choose a reason for hiding this comment

Uh oh!

clsotog left a comment

Choose a reason for hiding this comment

Uh oh!

nirmoy commented Mar 25, 2026

Uh oh!

nvmochs commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

JiandiAnNVIDIA commented Mar 12, 2026 •

edited

Loading