Skip to content

[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support#342

Open
JiandiAnNVIDIA wants to merge 642 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl_2026-03-04
Open

[linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support#342
JiandiAnNVIDIA wants to merge 642 commits intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl_2026-03-04

Conversation

@JiandiAnNVIDIA
Copy link
Copy Markdown

@JiandiAnNVIDIA JiandiAnNVIDIA commented Mar 12, 2026

Description

This patch series adds comprehensive CXL (Compute Express Link) support to the
nvidia-6.17 kernel, including:

  1. CXL Type-2 device support - Enables accelerator devices (like GPUs and
    SmartNICs) to use CXL for coherent memory access via firmware-provisioned
    regions
  2. CXL RAS (Reliability, Availability, Serviceability) error handling -
    Implements PCIe Port Protocol error handling and logging for CXL Root Ports,
    Downstream Switch Ports, and Upstream Switch Ports
  3. CXL DVSEC and HDM state save/restore - Preserves CXL DVSEC control/range
    registers and HDM decoder programming across PCI resets and link transitions,
    enabling device re-initialization after reset for firmware-provisioned
    configurations
  4. CXL Reset support - Implements the CXL Reset method (CXL Spec v3.2,
    Sections 8.1.3, 9.6, 9.7) via a sysfs interface for Type-2 devices,
    including memory offlining, cache flushing, multi-function sibling
    coordination, and DVSEC reset sequencing
  5. Multi-level interleaving fix - Supports firmware-configured CXL
    interleaving where lower levels use smaller granularities than parent ports
    (reverse HPA bit ordering)
  6. Prerequisite CXL and PCI driver updates - Cherry-picked commits from
    upstream torvalds/master covering the range from v6.17.9 to the merge
    point of Terry Bowman's v14 series into v7.0
  7. CXL DAX support - Enables direct memory access to CXL RAM regions and
    mapping CXL DAX devices as System-RAM

Key Features Added:

  • CXL Type-2 accelerator device registration and memory management
  • CXL region creation by Type-2 drivers
  • DPA (Device Physical Address) allocation interface for accelerators
  • HPA (Host Physical Address) free space enumeration
  • Multi-level CXL address translation (SPA↔HPA↔DPA)
  • CXL protocol error detection, forwarding, and recovery
  • CXL RAS error handling for Endpoints, RCH, and Switch Ports
    (replacing the old PCIEAER_CXL symbol with the new CXL_RAS def_bool)
  • CXL extended linear cache region support
  • CXL DVSEC and HDM decoder state save/restore across PCI resets
  • CXL Reset sysfs interface (/sys/bus/pci/devices/.../cxl_reset) for
    Type-2 devices with Reset Capable bit set
  • Multi-function sibling coordination during CXL reset via Non-CXL
    Function Map DVSEC
  • CPU cache flush using cpu_cache_invalidate_memregion() during reset
  • Multi-level interleaving with smaller granularities for lower decoder
    levels (firmware-provisioned configurations)
  • CXL DAX device access (DEV_DAX_CXL) and System-RAM mapping
    (DEV_DAX_KMEM)
  • CXL protocol error injection via APEI EINJ (ACPI_APEI_EINJ_CXL)

Justification

CXL Type-2 device support is critical for next-generation NVIDIA accelerators
and data center workloads:

  • Enables coherent memory sharing between CPUs and accelerators
  • Supports firmware-provisioned CXL regions for accelerator memory
  • Provides proper error handling and reporting for CXL fabric errors
  • Enables device reset and state recovery for CXL Type-2 devices
  • Preserves firmware-programmed DVSEC and HDM decoder state across resets
  • Required for upcoming NVIDIA hardware with CXL capabilities

Source

Patch Breakdown (139 patches + 1 revert):

# Category Count Source
1 Revert old CXL reset (f198764) 1 OOT (cleanup)
2 Upstream CXL/PCI prerequisite cherry-picks 103 Upstream torvalds/master (v6.17.9 → merge of Terry Bowman v14 into v7.0)
3 Smita Koralahalli's CXL EINJ series v6 patch 3/9 1 LKML (v6, not yet merged)
4 Alejandro Lucero's CXL Type-2 series v23 22 LKML (v23, not yet merged)
5 Robert Richter's multi-level interleaving fix 1 LKML (v1, not yet merged)
6 Srirangan Madhavan's CXL state save/restore series 5 LKML (v1, not yet merged)
7 Srirangan Madhavan's CXL reset series 7 LKML (v5, not yet merged)
8 Config annotations update 3 OOT (build config)
TOTAL 143

Notes on the upstream cherry-picks (item 2):

The 103 upstream commits span 1bfd0faa78d0 (v6.17.9) to
0da3050bdded (Merge of for-7.0/cxl-aer-prep into cxl-for-next).
This range includes 17 out of 34 patches from Terry Bowman's v14 series
that were reworked by the CXL maintainer and merged into v7.0 via the
for-7.0/cxl-aer-prep branch. The remaining 17 patches from Terry's v14
were refactored into v15 (9 patches, not yet merged) and are not included
in this port.

Notes on the save/restore and reset series (items 6–7):

Srirangan's patches were authored against upstream v7.0-rc1 (which does not
include Alejandro's v23 Type-2 series). For this port, the header
reorganization in patch 2/5 of the save/restore series was adapted to align
with Alejandro's v23 approach: HDM decoder and register map definitions were
moved to include/cxl/cxl.h (not include/cxl/pci.h as in the original
patch) to follow the convention established by Alejandro's series. Upstream
reviewers have indicated that Srirangan's series should be rebased on top of
Alejandro's once it merges.

Lore Links:

Upstream Status:

Series Status
103 upstream cherry-picks ✅ Merged in torvalds/master (v7.0 range)
Terry Bowman v14 (17 patches) ✅ Merged into v7.0 via for-7.0/cxl-aer-prep
Terry Bowman v15 (9 patches) ⏳ Under review, not needed for this port
Smita v6 patch 3/9 ⏳ Under review, not yet merged
Alejandro v23 (22 patches) ⏳ Under review, not yet merged
Robert Richter v1 (1 patch) ⏳ Under review, not yet merged
Srirangan save/restore (5 patches) ⏳ Under review, not yet merged
Srirangan cxl_reset v5 (7 patches) ⏳ Under review, not yet merged

Testing

Build Validation:

  • Built successfully for ARM64 4K page size kernel
  • Built successfully for ARM64 64K page size kernel

Config Verification:

CXL-related configs enabled as expected:

CONFIG_ACPI_APEI_EINJ_CXL=y
CONFIG_PCI_CXL=y
CONFIG_CXL_BUS=y
CONFIG_CXL_PCI=y
CONFIG_CXL_MEM_RAW_COMMANDS=y
CONFIG_CXL_ACPI=m
CONFIG_CXL_PMEM=m
CONFIG_CXL_MEM=y
CONFIG_CXL_FEATURES=y
# CONFIG_CXL_EDAC_MEM_FEATURES is not set
CONFIG_CXL_PORT=y
CONFIG_CXL_SUSPEND=y
CONFIG_CXL_REGION=y
# CONFIG_CXL_REGION_INVALIDATION_TEST is not set
CONFIG_CXL_RAS=y
# CONFIG_CACHEMAINT_FOR_HOTPLUG is not set
# CONFIG_SFC_CXL is not set
CONFIG_CXL_PMU=m
CONFIG_DEV_DAX=y
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_HMEM=m
CONFIG_DEV_DAX_CXL=y
CONFIG_DEV_DAX_HMEM_DEVICES=y
CONFIG_DEV_DAX_KMEM=y
CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION=y
CONFIG_GENERIC_CPU_CACHE_MAINTENANCE=y

Runtime Testing:

  • Boot test on ARM64 system
  • CXL device enumeration test (ls /sys/bus/cxl/devices/)
  • CXL interleaving testing
  • CXL reset test (echo 1 > /sys/bus/pci/devices/<dev>/cxl_reset)
  • DVSEC save/restore verified (CXLCtl, Range registers preserved)

Notes

  • CONFIG_PCIEAER_CXL has been removed from Kconfig by upstream commit
    d18f1b7beadf (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS).
    The debian.master annotation for PCIEAER_CXL=y is overridden to -
    in debian.nvidia-6.17/config/annotations.
  • CONFIG_CXL_BUS, CONFIG_CXL_PCI, CONFIG_CXL_MEM, CONFIG_CXL_PORT
    remain tristate (not bool) — the v14 series kept them as tristate,
    unlike earlier draft versions.
  • CONFIG_DEV_DAX, CONFIG_DEV_DAX_CXL, and CONFIG_DEV_DAX_KMEM are
    overridden from m (debian.master default) to y to support built-in
    CXL RAM region DAX access and System-RAM mapping.
  • CONFIG_PCI_CXL is a new hidden bool introduced by the save/restore
    series; auto-enabled when CXL_BUS=y. Gates compilation of
    drivers/pci/cxl.o for DVSEC and HDM state save/restore.
  • CONFIG_GENERIC_CPU_CACHE_MAINTENANCE and
    CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION are new configs
    introduced by the upstream cherry-picks; arm64 auto-selects both.
    cpu_cache_invalidate_memregion() is also used by the CXL reset
    series for cache flushing during reset.
  • Kernel config annotations updated in debian.nvidia-6.17/config/annotations
    to reflect all of the above changes.
  • Srirangan's save/restore series header reorganization was adapted to
    align with Alejandro's v23 approach (include/cxl/cxl.h instead of
    include/cxl/pci.h). See commit message on patch 2/5 for details.

nvmochs and others added 30 commits February 13, 2026 16:49
BugLink: https://bugs.launchpad.net/bugs/2139315

This reverts commit 93c54a5 so that it
can be replaced by the upstream equivalents.

Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Add cpu part and model macro definitions for NVIDIA Olympus core.

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit e185c8a)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Add the part number and MIDR for NVIDIA Olympus.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
(cherry picked from commit d5e4c71)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Add NVIDIA Olympus MIDR to neoverse_spe range list.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
(backported from commit d852b83)
[mochs: Minor context cleanup due to absence of "perf arm_spe: Add CPU variants supporting common data source packet"]
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

The documentation in nvidia-pmu.rst contains PMUs specific
to NVIDIA Tegra241 SoC. Rename the file for this specific
SoC to have better distinction with other NVIDIA SoC.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Adds Unified Coherent Fabric PMU support in Tegra410 SOC.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Add interface to get ACPI device associated with the
PMU. This ACPI device may contain additional properties
not covered by the standard properties.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Adds PCIE PMU support in Tegra410 SOC.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Adds PCIE-TGT PMU support in Tegra410 SOC.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Adds CPU Memory (CMEM) Latency  PMU support in Tegra410 SOC.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Adds NVIDIA C2C PMU support in Tegra410 SOC.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139315

Enable driver for NVIDIA TEGRA410 CMEM Latency and C2C PMU device.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260126181155.2776097-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
… events

BugLink: https://bugs.launchpad.net/bugs/2139315

Add JSON files for NVIDIA Tegra410 Olympus core PMU events.
Also updated the common-and-microarch.json.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
(backported from https://lore.kernel.org/all/20260127225909.3296202-1-bwicaksono@nvidia.com/)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…EGRA410_CMEM_LATENCY_PMU

BugLink: https://bugs.launchpad.net/bugs/2139315

Set the following kconfigs to enable these PMUs on T410:
    CONFIG_NVIDIA_TEGRA410_C2C_PMU=m
    CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU=m

Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138131

Properly pass the variadic arguments so it can be called with or without
them depending on the format.

Signed-off-by: Lucas De Marchi <ldemarchi@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2140343

A vDEVICE has been a hard requirement for attaching a nested domain to the
device. This makes sense when installing a guest STE, since a vSID must be
present and given to the kernel during the vDEVICE allocation.

But, when CR0.SMMUEN is disabled, VM doesn't really need a vSID to program
the vSMMU behavior as GBPA will take effect, in which case the vSTE in the
nested domain could have carried the bypass or abort configuration in GBPA
register. Thus, having such a hard requirement doesn't work well for GBPA.

Skip vmaster allocation in arm_smmu_attach_prepare_vmaster() for an abort
or bypass vSTE. Note that device on this attachment won't report vevents.

Update the uAPI doc accordingly.

Link: https://patch.msgid.link/r/20251103172755.2026145-1-nicolinc@nvidia.com
Tested-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
(backported from commit 81c45c6)
Signed-off-by: Nathan Chen <nathanc@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2140343

The function hugetlb_reserve_pages() returns the number of pages added
to the reservation map on success and a negative error code on failure
(e.g. -EINVAL, -ENOMEM). However, in some error paths, it may return -1
directly.

For example, a failure at:

    if (hugetlb_acct_memory(h, gbl_reserve) < 0)
        goto out_put_pages;

results in returning -1 (since add = -1), which may be misinterpreted
in userspace as -EPERM.

Fix this by explicitly capturing and propagating the return values from
helper functions, and using -EINVAL for all other failure cases.

Link: https://lkml.kernel.org/r/20251125171350.86441-1-skolothumtho@nvidia.com
Fixes: 986f5f2 ("mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updated")
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(backported from commit 9ee5d17)
Signed-off-by: Nathan Chen <nathanc@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2140343

The Enable bits in CMDQV/VINTF/VCMDQ_CONFIG registers do not actually reset
the HW registers. So, the driver explicitly clears all the registers when a
VINTF or VCMDQ is being initialized calling its hw_deinit() function.

However, a userspace VCMDQ is not properly reset, unlike an in-kernel VCMDQ
getting reset in tegra241_vcmdq_hw_init().

Meanwhile, tegra241_vintf_hw_init() calling tegra241_vintf_hw_deinit() will
not deinit any VCMDQ, since there is no userspace VCMDQ mapped to the VINTF
at that stage.

Then, this may result in dirty VCMDQ registers, which can fail the VM.

Like tegra241_vcmdq_hw_init(), reset a VCMDQ in tegra241_vcmdq_hw_init() to
fix this bug. This is required by a host kernel.

Fixes: 6717f26ab1e7 ("iommu/tegra241-cmdqv: Add user-space use support")
Cc: stable@vger.kernel.org
Reported-by: Bao Nguyen <ncqb@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
(backported from commit 80f1a2c)
Signed-off-by: Nathan Chen <nathanc@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
… transfer

BugLink: https://bugs.launchpad.net/bugs/2139640

When the ISR thread wakes up late and finds that the timeout handler
has already processed the transfer (curr_xfer is NULL), return
IRQ_HANDLED instead of IRQ_NONE.

Use a similar approach to tegra_qspi_handle_timeout() by reading
QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the
hardware actually completed the transfer. If QSPI_RDY is set, the
interrupt was legitimate and triggered by real hardware activity.
The fact that the timeout path handled it first doesn't make it
spurious. Returning IRQ_NONE incorrectly suggests the interrupt
wasn't for this device, which can cause issues with shared interrupt
lines and interrupt accounting.

Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://patch.msgid.link/20260126-tegra_xfer-v2-1-6d2115e4f387@debian.org
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit aabd8ea linux-next)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139640

Move the assignment of the transfer pointer from curr_xfer inside the
spinlock critical section in both handle_cpu_based_xfer() and
handle_dma_based_xfer().

Previously, curr_xfer was read before acquiring the lock, creating a
window where the timeout path could clear curr_xfer between reading it
and using it. By moving the read inside the lock, the handlers are
guaranteed to see a consistent value that cannot be modified by the
timeout path.

Fixes: 921fc18 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Thierry Reding <treding@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20260126-tegra_xfer-v2-2-6d2115e4f387@debian.org
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit ef13ba3 linux-next)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…transfer_one

BugLink: https://bugs.launchpad.net/bugs/2139640

When the timeout handler processes a completed transfer and signals
completion, the transfer thread can immediately set up the next transfer
and assign curr_xfer to point to it.

If a delayed ISR from the previous transfer then runs, it checks if
(!tqspi->curr_xfer) (currently without the lock also -- to be fixed
soon) to detect stale interrupts, but this check passes because
curr_xfer now points to the new transfer. The ISR then incorrectly
processes the new transfer's context.

Protect the curr_xfer assignment with the spinlock to ensure the ISR
either sees NULL (and bails out) or sees the new value only after the
assignment is complete.

Fixes: 921fc18 ("spi: tegra210-quad: Add support for Tegra210 QSPI controller")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://patch.msgid.link/20260126-tegra_xfer-v2-3-6d2115e4f387@debian.org
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit f5a4d7f linux-next)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139640

The curr_xfer field is read by the IRQ handler without holding the lock
to check if a transfer is in progress. When clearing curr_xfer in the
combined sequence transfer loop, protect it with the spinlock to prevent
a race with the interrupt handler.

Protect the curr_xfer clearing at the exit path of
tegra_qspi_combined_seq_xfer() with the spinlock to prevent a race
with the interrupt handler that reads this field.

Without this protection, the IRQ handler could read a partially updated
curr_xfer value, leading to NULL pointer dereference or use-after-free.

Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://patch.msgid.link/20260126-tegra_xfer-v2-4-6d2115e4f387@debian.org
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit bf4528a linux-next)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…ined_seq_xfer

BugLink: https://bugs.launchpad.net/bugs/2139640

Protect the curr_xfer clearing in tegra_qspi_non_combined_seq_xfer()
with the spinlock to prevent a race with the interrupt handler that
reads this field to check if a transfer is in progress.

Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://patch.msgid.link/20260126-tegra_xfer-v2-5-6d2115e4f387@debian.org
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit 6d7723e linux-next)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139640

Now that all other accesses to curr_xfer are done under the lock,
protect the curr_xfer NULL check in tegra_qspi_isr_thread() with the
spinlock. Without this protection, the following race can occur:

  CPU0 (ISR thread)              CPU1 (timeout path)
  ----------------               -------------------
  if (!tqspi->curr_xfer)
    // sees non-NULL
                                 spin_lock()
                                 tqspi->curr_xfer = NULL
                                 spin_unlock()
  handle_*_xfer()
    spin_lock()
    t = tqspi->curr_xfer  // NULL!
    ... t->len ...        // NULL dereference!

With this patch, all curr_xfer accesses are now properly synchronized.

Although all accesses to curr_xfer are done under the lock, in
tegra_qspi_isr_thread() it checks for NULL, releases the lock and
reacquires it later in handle_cpu_based_xfer()/handle_dma_based_xfer().
There is a potential for an update in between, which could cause a NULL
pointer dereference.

To handle this, add a NULL check inside the handlers after acquiring
the lock. This ensures that if the timeout path has already cleared
curr_xfer, the handler will safely return without dereferencing the
NULL pointer.

Fixes: b4e002d ("spi: tegra210-quad: Fix timeout handling")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://patch.msgid.link/20260126-tegra_xfer-v2-6-6d2115e4f387@debian.org
Signed-off-by: Mark Brown <broonie@kernel.org>
(cherry picked from commit edf9088 linux-next)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2139648

Currently cpu-clock event always returns 0 count, e.g.,

perf stat -e cpu-clock -- sleep 1

 Performance counter stats for 'sleep 1':
                 0      cpu-clock                        #    0.000 CPUs utilized
       1.002308394 seconds time elapsed

The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle
 error of some clock events")' adds PERF_EF_UPDATE flag check before
calling cpu_clock_event_update() to update the count, however the
PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in
counting mode (pmu->dev() -> cpu_clock_event_del() ->
cpu_clock_event_stop()). This leads to the cpu-clock event count is
never updated.

To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event
just like what task-clock does.

Fixes: bc4394e ("perf: Fix the throttle error of some clock events")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com
(cherry picked from commit f1f9651)
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2093957

Signed-off-by: Jeremy Szu <jszu@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2138892

Remove this declaration which is now used within the file
after merging upstream "vfio/nvgrace-gpu: register device memory for
poison handling".

Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2136828

Add PCI_VENDOR_ID_ASPEED to the shared pci_ids.h header and remove the
duplicate local definition from ehci-pci.c.

This prepares for adding a PCI quirk for ASPEED devices.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
(backported from https://lore.kernel.org/linux-iommu/20251217154529.377586-1-nirmoyd@nvidia.com/)
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
BugLink: https://bugs.launchpad.net/bugs/2136828

ASPEED BMC controllers have VGA and USB functions behind a PCIe-to-PCI
bridge that causes them to share the same stream ID:

  [e0]---00.0-[e1-e2]----00.0-[e2]--+-00.0  ASPEED Graphics Family
                                    \-02.0  ASPEED USB Controller

Both devices get stream ID 0x5e200 due to bridge aliasing, causing the
USB controller to be rejected with 'Aliasing StreamID unsupported'.

Per ASPEED, the AST1150 doesn't use a real PCI bus and always forwards
the original requester ID from downstream devices rather than replacing
it with any alias.

Add a new PCI_DEV_FLAGS_PCI_BRIDGE_NO_ALIASES flag and apply it to the
AST1150.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
(backported from https://lore.kernel.org/linux-iommu/20251217154529.377586-2-nirmoyd@nvidia.com/)
[nirmoy: set PCI_DEV_FLAGS_PCI_BRIDGE_NO_ALIASES to (1 << 15) instead of (1 << 14)]
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Acked-by: Abdur Rahman <abdur.rahman@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
…ivation (LFA)

BugLink: https://bugs.launchpad.net/bugs/2138342

The Arm Live Firmware Activation (LFA) is a specification [1] to describe
activating firmware components without a reboot. Those components
(like TF-A's BL31, EDK-II, TF-RMM, secure paylods) would be updated the
usual way: via fwupd, FF-A or other secure storage methods, or via some
IMPDEF Out-Of-Bound method. The user can then activate this new firmware,
at system runtime, without requiring a reboot.
The specification covers the SMCCC interface to list and query available
components and eventually trigger the activation.

Add a new directory under /sys/firmware to present firmware components
capable of live activation. Each of them is a directory under lfa/,
and is identified via its GUID. The activation will be triggered by echoing
"1" into the "activate" file:
==========================================
/sys/firmware/lfa # ls -l . 6c*
.:
total 0
drwxr-xr-x    2 0 0         0 Jan 19 11:33 47d4086d-4cfe-9846-9b95-2950cbbd5a00
drwxr-xr-x    2 0 0         0 Jan 19 11:33 6c0762a6-12f2-4b56-92cb-ba8f633606d9
drwxr-xr-x    2 0 0         0 Jan 19 11:33 d6d0eea7-fcea-d54b-9782-9934f234b6e4

6c0762a6-12f2-4b56-92cb-ba8f633606d9:
total 0
--w-------    1 0        0             4096 Jan 19 11:33 activate
-r--r--r--    1 0        0             4096 Jan 19 11:33 activation_capable
-r--r--r--    1 0        0             4096 Jan 19 11:33 activation_pending
--w-------    1 0        0             4096 Jan 19 11:33 cancel
-r--r--r--    1 0        0             4096 Jan 19 11:33 cpu_rendezvous
-r--r--r--    1 0        0             4096 Jan 19 11:33 current_version
-rw-r--r--    1 0        0             4096 Jan 19 11:33 force_cpu_rendezvous
-r--r--r--    1 0        0             4096 Jan 19 11:33 may_reset_cpu
-r--r--r--    1 0        0             4096 Jan 19 11:33 name
-r--r--r--    1 0        0             4096 Jan 19 11:33 pending_version
/sys/firmware/lfa/6c0762a6-12f2-4b56-92cb-ba8f633606d9 # grep . *
grep: activate: Permission denied
activation_capable:1
activation_pending:1
grep: cancel: Permission denied
cpu_rendezvous:1
current_version:0.0
force_cpu_rendezvous:1
may_reset_cpu:0
name:TF-RMM
pending_version:0.0
/sys/firmware/lfa/6c0762a6-12f2-4b56-92cb-ba8f633606d9 # echo 1 > activate
[ 2825.797871] Arm LFA: firmware activation succeeded.
/sys/firmware/lfa/6c0762a6-12f2-4b56-92cb-ba8f633606d9 #
==========================================

[1] https://developer.arm.com/documentation/den0147/latest/

Signed-off-by: Salman Nabi <salman.nabi@arm.com>
Signed-off-by: Vedashree Vidwans <vvidwans@nvidia.com>
(backported from https://lore.kernel.org/all/20260119122729.287522-2-salman.nabi@arm.com/)
[nirmoyd: Added image_name fallback to fw_uuid in update_fw_image_node()]
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com>
Acked-by: Noah Wager <noah.wager@canonical.com>
Signed-off-by: Brad Figg <bfigg@nvidia.com>
alucerop and others added 18 commits March 24, 2026 11:58
Use cxl api for creating a region using the endpoint decoder related to
a DPA range.

Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
(backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
A PIO buffer is a region of device memory to which the driver can write a
packet for TX, with the device handling the transmit doorbell without
requiring a DMA for getting the packet data, which helps reducing latency
in certain exchanges. With CXL mem protocol this latency can be lowered
further.

With a device supporting CXL and successfully initialised, use the cxl
region to map the memory range and use this mapping for PIO buffers.

Add the disabling of those CXL-based PIO buffers if the callback for
potential cxl endpoint removal by the CXL code happens.

Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
(backported from https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…smaller granularities for lower levels

The CXL specification supports multi-level interleaving "as long as
all the levels use different, but consecutive, HPA bits to select the
target and no Interleave Set has more than 8 devices" (from 3.2).

Currently the kernel expects that a decoder's "interleave granularity
is a multiple of @parent_port granularity". That is, the granularity
of a lower level is bigger than those of the parent and uses the outer
HPA bits as selector. It works e.g. for the following 8-way config:

 * cross-link (cross-hostbridge config in CFMWS):
   * 4-way
   * 256 granularity
   * Selector: HPA[8:9]
 * sub-link (CXL Host bridge config of the HDM):
   * 2-way
   * 1024 granularity
   * Selector: HPA[10]

Now, if the outer HPA bits are used for the cross-hostbridge, an 8-way
config could look like this:

 * cross-link (cross-hostbridge config in CFMWS):
   * 4-way
   * 512 granularity
   * Selector: HPA[9:10]
 * sub-link (CXL Host bridge config of the HDM):
   * 2-way
   * 256 granularity
   * Selector: HPA[8]

The enumeration of decoders for this configuration fails then with
following error:

 cxl region0: pci0000:00:port1 cxl_port_setup_targets expected iw: 2 ig: 1024 [mem 0x10000000000-0x1ffffffffff flags 0x200]
 cxl region0: pci0000:00:port1 cxl_port_setup_targets got iw: 2 ig: 256 state: enabled 0x10000000000:0x1ffffffffff
 cxl_port endpoint12: failed to attach decoder12.0 to region0: -6

Note that this happens only if firmware is setting up the decoders
(CXL_REGION_F_AUTO). For userspace region assembly the granularities
are chosen to increase from root down to the lower levels. That is,
outer HPA bits are always used for lower interleaving levels.

Rework the implementation to also support multi-level interleaving
with smaller granularities for lower levels. Determine the interleave
set of autodetected decoders. Check that it is a subset of the root
interleave.

The HPA selector bits are extracted for all decoders of the set and
checked that there is no overlap and bits are consecutive. All
decoders can be programmed now to use any bit range within the
region's target selector.

Signed-off-by: Robert Richter <rrichter@amd.com>
(backported from https://lore.kernel.org/all/20251028094754.72816-1-rrichter@amd.com/)
[jan: Resolved minor conflicts]
Signed-off-by: Jiandi An <jan@nvidia.com>
…er definitions

PCI: Add CXL DVSEC control, lock, and range register definitions

Add register offset and field definitions for CXL DVSEC registers needed
by CXL state save/restore across resets:

  - CTRL2 (offset 0x10) and LOCK (offset 0x14) registers
  - CONFIG_LOCK bit in the LOCK register
  - RWL (read-write-when-locked) field masks for CTRL and range base
    registers.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
… to include/cxl/cxl.h

Move CXL HDM decoder register defines, register map structs
(cxl_reg_map, cxl_component_reg_map, cxl_device_reg_map,
cxl_pmu_reg_map, cxl_register_map), cxl_hdm_decoder_count(),
enum cxl_regloc_type, and cxl_find_regblock()/cxl_setup_regs()
declarations from internal CXL headers to include/cxl/pci.h.

This makes them accessible to code outside the CXL subsystem, in
particular the PCI core CXL state save/restore support added in a
subsequent patch.

No functional change.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/)
[jan: Resolve conflicts by moving certain definitions to include/cxl/cxl.h instead of to include/cxl/pci.h to align with its dependency of Alejandro's series]
Signed-off-by: Jiandi An <jan@nvidia.com>
…state

Add pci_add_virtual_ext_cap_save_buffer() to allocate save buffers
using virtual cap IDs (above PCI_EXT_CAP_ID_MAX) that don't require
a real capability in config space.

The existing pci_add_ext_cap_save_buffer() cannot be used for
CXL DVSEC state because it calls pci_find_saved_ext_cap()
which searches for a matching capability in PCI config space.
The CXL state saved here is a synthetic snapshot (DVSEC+HDM)
and should not be tied to a real extended-cap instance. A
virtual extended-cap save buffer API (cap IDs above
PCI_EXT_CAP_ID_MAX) allows PCI to track this state without
a backing config space capability.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
Save and restore CXL DVSEC control registers (CTRL, CTRL2), range
base registers, and lock state across PCI resets.

When the DVSEC CONFIG_LOCK bit is set, certain DVSEC fields
become read-only and hardware may have updated them. Blindly
restoring saved values would be silently ignored or conflict
with hardware state. Instead, a read-merge-write approach is
used: current hardware values are read for the RWL
(read-write-when-locked) fields and merged with saved state,
so only writable bits are restored while locked bits retain
their hardware values.

Hooked into pci_save_state()/pci_restore_state() so all PCI reset
paths automatically preserve CXL DVSEC configuration.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/)
[jan: Resolve minor conflict in drivers/pci/Makefile due to code line shifts ]
Signed-off-by: Jiandi An <jan@nvidia.com>
Save and restore CXL HDM decoder registers (global control,
per-decoder base/size/target-list, and commit state) across PCI
resets. On restore, decoders that were committed are reprogrammed
and recommitted with a 10ms timeout. Locked decoders that are
already committed are skipped, since their state is protected by
hardware and reprogramming them would fail.

The Register Locator DVSEC is parsed directly via PCI config space
reads rather than calling cxl_find_regblock()/cxl_setup_regs(),
since this code lives in the PCI core and must not depend on CXL
module symbols.

MSE is temporarily enabled during save/restore to allow MMIO
access to the HDM decoder register block.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306080026.116789-1-smadhavan@nvidia.com/)
[jan: Include <cxl/cxl.h> in drivers/pci/cxl.c due to conflict resolution in "4acbc27592b8 NVIDIA: VR: SAUCE: cxl: Move HDM decoder and register map definitions to include/cxl/cxl.h"]
Signed-off-by: Jiandi An <jan@nvidia.com>
…efinitions

Add CXL DVSEC register definitions needed for CXL device reset per
CXL r3.2 section 8.1.3.1:
- Capability bits: RST_CAPABLE, CACHE_CAPABLE, CACHE_WBI_CAPABLE,
  RST_TIMEOUT, RST_MEM_CLR_CAPABLE
- Control2 register: DISABLE_CACHING, INIT_CACHE_WBI, INIT_CXL_RST,
  RST_MEM_CLR_EN
- Status2 register: CACHE_INV, RST_DONE, RST_ERR
- Non-CXL Function Map DVSEC register offset

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/)
[jan: Resolve conflicts where PCI_DVSEC_CXL_CACHE_CAPABLE is already added by "72bd823fb4f1 NVIDIA: VR: SAUCE: PCI: Allow ATS to be always on for CXL.cache capable devices"]
Signed-off-by: Jiandi An <jan@nvidia.com>
…_restore()

Export pci_dev_save_and_disable() and pci_dev_restore() so that
subsystems performing non-standard reset sequences (e.g. CXL)
can reuse the PCI core standard pre/post reset lifecycle:
driver reset_prepare/reset_done callbacks, PCI config space
save/restore, and device disable/re-enable.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
Add infrastructure for quiescing the CXL data path before reset:

- Memory offlining: check if CXL-backed memory is online and offline
  it via offline_and_remove_memory() before reset, per CXL
  spec requirement to quiesce all CXL.mem transactions before issuing
  CXL Reset.
- CPU cache flush: invalidate cache lines before reset
  as a safety measure after memory offline.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…XL reset

Add sibling PCI function save/disable/restore coordination for CXL
reset. Before reset, all CXL.cachemem sibling functions are locked,
saved, and disabled; after reset they are restored. The Non-CXL Function
Map DVSEC and per-function DVSEC capability register are consulted to
skip non-CXL and CXL.io-only functions. A global mutex serializes
concurrent resets to prevent deadlocks between sibling functions.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…ration

cxl_dev_reset() implements the hardware reset sequence:
optionally enable memory clear, initiate reset via
CTRL2, wait for completion, and re-enable caching.

cxl_do_reset() orchestrates the full reset flow:
  1. CXL pre-reset: mem offlining and cache flush (when memdev present)
  2. PCI save/disable: pci_dev_save_and_disable() automatically saves
     CXL DVSEC and HDM decoder state via PCI core hooks
  3. Sibling coordination: save/disable CXL.cachemem sibling functions
  4. Execute CXL DVSEC reset
  5. Sibling restore: always runs to re-enable sibling functions
  6. PCI restore: pci_dev_restore() automatically restores CXL state

The CXL-specific DVSEC and HDM save/restore is handled
by the PCI core's CXL save/restore infrastructure (drivers/pci/cxl.c).

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
Add a "cxl_reset" sysfs attribute to PCI devices that support CXL
Reset (CXL r3.2 section 8.1.3.1). The attribute is visible only on
devices with both CXL.cache and CXL.mem capabilities and the CXL
Reset Capable bit set in the DVSEC.

Writing "1" to the attribute triggers the full CXL reset flow via
cxl_do_reset(). The interface is decoupled from memdev creation:
when a CXL memdev exists, memory offlining and cache flush are
performed; otherwise reset proceeds without the memory management.

The sysfs attribute is managed entirely by the CXL module using
sysfs_create_group() / sysfs_remove_group() rather than the PCI
core's static attribute groups. This avoids cross-module symbol
dependencies between the PCI core (always built-in) and CXL_BUS
(potentially modular).

At module init, existing PCI devices are scanned and a PCI bus
notifier handles hot-plug/unplug. kernfs_drain() makes sure that
any in-flight store() completes before sysfs_remove_group() returns,
preventing use-after-free during module unload.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…tribute

Document the cxl_reset sysfs attribute added to PCI devices that
support CXL Reset.

Signed-off-by: Srirangan Madhavan <smadhavan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…and RAS support

Add Ubuntu kernel config annotations for CXL-related configs introduced
or changed by the following cherry-picked patch series:
  - drivers/cxl changes between v6.17.9 and upstream 7.0 (which includes
    a portion of Terry Bowman's v14 CXL RAS series merged via
    for-7.0/cxl-aer-prep)
  - Alejandro Lucero's v23 CXL Type-2 device support series
  - Smita Koralahalli's v6 patch 3/9 (cxl/region: Skip decoder reset on
    detach for autodiscovered regions)

CONFIG_CXL_BUS:           Enable CXL bus support built-in; required for
                          CXL Type-2 device and RAS support
CONFIG_CXL_PCI:           Enable CXL PCI management built-in; auto-selects
                          CXL_MEM; required for CXL Type-2 device support
CONFIG_CXL_MEM:           Auto-selected by CXL_PCI; required for CXL
                          memory expansion and Type-2 device support
CONFIG_CXL_PORT:          Required for CXL port enumeration; defaults to
                          CXL_BUS value
CONFIG_FWCTL:             Selected by CXL_BUS when CXL_FEATURES is enabled;
                          required for CXL feature mailbox access
CONFIG_CXL_RAS:           New def_bool replacing PCIEAER_CXL (Terry Bowman
                          v14); auto-enabled with ACPI_APEI_GHES+PCIEAER+
                          CXL_BUS for CXL RAS error handling
CONFIG_SFC_CXL:           Solarflare SFC9100-family CXL Type-2 device
                          support; not needed for NVIDIA platforms (n)
CONFIG_ACPI_APEI_EINJ:    Required prerequisite for CONFIG_ACPI_APEI_EINJ_CXL
CONFIG_ACPI_APEI_EINJ_CXL: CXL protocol error injection support via APEI EINJ

CONFIG_PCIEAER_CXL: Remove it from debian.master policy. This config
  was removed from Kconfig by upstream commit d18f1b7
 (PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS) which is included
 in this port.

CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION: Override debian.master
  amd64-only policy to include arm64. Commit 4d873c5 added
  'select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION' to arch/arm64/Kconfig,
  making this y on arm64 as well.

CONFIG_GENERIC_CPU_CACHE_MAINTENANCE: New bool config defined by
  c460697 in lib/Kconfig. Selected by arm64 via 4d873c5;
  not selected by x86. Set arm64: y, amd64: -.

CONFIG_CACHEMAINT_FOR_HOTPLUG: New optional menuconfig defined by
  2ec3b54 in drivers/cache/Kconfig. Depends on
  GENERIC_CPU_CACHE_MAINTENANCE so becomes visible on arm64. Defaults
  to n; HiSilicon HHA driver not needed for NVIDIA platforms.
  Set arm64: n, amd64: -.

Signed-off-by: Jiandi An <jan@nvidia.com>
…memory access

Override debian.master policy (m->y) for DEV_DAX, DEV_DAX_CXL, and
DEV_DAX_KMEM to ensure CXL memory regions are accessible as both raw
DAX devices and hotplugged System-RAM nodes.

debian.master sets these to 'm' (modules). For NVIDIA platforms with
CXL Type-2 devices, built-in (y) is required to ensure CXL memory
regions provisioned early in boot are immediately accessible without
relying on module loading order.

CONFIG_DEV_DAX:     Override m->y; prerequisite for DEV_DAX_CXL and
                    DEV_DAX_KMEM to be built-in; depends on
                    TRANSPARENT_HUGEPAGE (already y in debian.master)

CONFIG_DEV_DAX_CXL: Override m->y; creates /dev/daxX.Y devices for CXL
                    RAM regions not in the default system memory map
                    (Soft Reserved or dynamically provisioned regions);
                    depends on CXL_BUS+CXL_REGION+DEV_DAX (all y)

CONFIG_DEV_DAX_KMEM: Override m->y; onlines CXL DAX devices as System-RAM
                    NUMA nodes via memory hotplug, making CXL memory
                    available for normal kernel and userspace allocation

Signed-off-by: Jiandi An <jan@nvidia.com>
…/restore

Add Ubuntu kernel config annotation for CONFIG_PCI_CXL introduced by
the CXL DVSEC and HDM state save/restore series (Srirangan Madhavan).

CONFIG_PCI_CXL:  Hidden bool in drivers/pci/Kconfig; auto-enabled when
                 CXL_BUS=y. Gates compilation of drivers/pci/cxl.o which
                 saves and restores CXL DVSEC control/range registers and
                 HDM decoder state across PCI resets and link transitions.

Signed-off-by: Jiandi An <jan@nvidia.com>
@JiandiAnNVIDIA
Copy link
Copy Markdown
Author

This patch "PCI: Update CXL DVSEC definitions" missed one rename

nvidia@localhost:/home/nvidia/NV-Kernels$ make
  CALL    scripts/checksyscalls.sh
  CC      drivers/pci/ats.o
drivers/pci/ats.c: In function ‘pci_cxl_ats_always_on’:
drivers/pci/ats.c:221:44: error: ‘CXL_DVSEC_PCIE_DEVICE’ undeclared (first use in this function); did you mean ‘PCI_DVSEC_CXL_DEVICE’?
  221 |                                            CXL_DVSEC_PCIE_DEVICE);
      |                                            ^~~~~~~~~~~~~~~~~~~~~
      |                                            PCI_DVSEC_CXL_DEVICE
drivers/pci/ats.c:221:44: note: each undeclared identifier is reported only once for each function it appears in
drivers/pci/ats.c:225:45: error: ‘CXL_DVSEC_CAP_OFFSET’ undeclared (first use in this function)
  225 |         pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET, &cap);
      |                                             ^~~~~~~~~~~~~~~~~~~~
make[4]: *** [scripts/Makefile.build:287: drivers/pci/ats.o] Error 1
make[3]: *** [scripts/Makefile.build:556: drivers/pci] Error 2
make[2]: *** [scripts/Makefile.build:556: drivers] Error 2
make[1]: *** [/home/nvidia/NV-Kernels/Makefile:2016: .] Error 2
make: *** [Makefile:248: __sub-make] Error 2

Fixed.

@clsotog
Copy link
Copy Markdown
Collaborator

clsotog commented Mar 24, 2026

I see the compiling issue is fixed. That was my concern.

@nvmochs nvmochs self-requested a review March 24, 2026 19:37
Copy link
Copy Markdown
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the name change fix in "PCI: Update CXL DVSEC definitions" and confirmed it builds successfully for arm64.

No further issues or concerns from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@clsotog clsotog self-requested a review March 24, 2026 19:58
Copy link
Copy Markdown
Collaborator

@clsotog clsotog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Carol L Soto <csoto@nvidia.com>

@nirmoy
Copy link
Copy Markdown
Collaborator

nirmoy commented Mar 25, 2026

Tried this on GB300 yesterday with the complilation issue manually fixed. Ran CUDA DVS tests like http://10.112.214.250:8002/. We still need to make sure that there are no regression with older RM driver. With that
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>

@JiandiAnNVIDIA JiandiAnNVIDIA changed the title [linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, and state save/restore [linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support Mar 30, 2026
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Mar 30, 2026

PR sent to Canonical.

@nvidia-bfigg nvidia-bfigg force-pushed the 24.04_linux-nvidia-6.17-next branch from 9364d8b to 8dab82a Compare April 2, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.