NVIDIA · fujitatomoya · Mar 1, 2026
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -84,7 +84,7 @@ pre-commit install
 
 Now code linters and formatters will be run each time you commit changes.
 
-You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, althoguh please note
+You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, although please note
 that this may result in pull requests being rejected if subsequent checks fail.
 
 ## Review Process

diff --git a/README.md b/README.md
@@ -552,9 +552,9 @@ These CUDA features are needed by some CUDA samples. They are provided by either
 
 CUFFT Callback Routines are user-supplied kernel routines that CUFFT will call when loading or storing data. These callback routines are only available on Linux x86_64 and ppc64le systems.
 
-#### CUDA Dynamic Parallellism
+#### CUDA Dynamic Parallelism
 
-CDP (CUDA Dynamic Parallellism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above.
+CDP (CUDA Dynamic Parallelism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above.
 
 #### Multi-block Cooperative Groups
 

diff --git a/Samples/1_Utilities/topologyQuery/README.md b/Samples/1_Utilities/topologyQuery/README.md
@@ -2,7 +2,7 @@
 
 ## Description
 
-A simple exemple on how to query the topology of a system with multiple GPU
+A simple example on how to query the topology of a system with multiple GPU
 
 ## Key Concepts
 

diff --git a/Samples/3_CUDA_Features/README.md b/Samples/3_CUDA_Features/README.md
@@ -2,7 +2,7 @@
 
 
 ### [bf16TensorCoreGemm](./bf16TensorCoreGemm)
-A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
+A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.
 
 ### [binaryPartitionCG](./binaryPartitionCG)
 This sample is a simple code that illustrates binary partition cooperative groups and reduce within the thread block.
@@ -36,7 +36,7 @@ This sample demonstrates the use of the new CUDA WMMA API employing the Tensor C
 In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.
 
 ### [dmmaTensorCoreGemm](./dmmaTensorCoreGemm)
-CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.
+CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.
 
 ### [globalToShmemAsyncCopy](./globalToShmemAsyncCopy)
 This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization.
@@ -69,7 +69,7 @@ A demonstration of CUDA Graphs creation, instantiation and launch using Graphs A
 This sample demonstrates basic use of stream priorities.
 
 ### [tf32TensorCoreGemm](./tf32TensorCoreGemm)
-A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
+A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.
 
 ### [warpAggregatedAtomicsCG](./warpAggregatedAtomicsCG)
 This sample demonstrates how using Cooperative Groups (CG) to perform warp aggregated atomics to single and multiple counters, a useful technique to improve performance when many threads atomically add to a single or multiple counters.

diff --git a/Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md b/Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md
@@ -2,7 +2,7 @@
 
 ## Description
 
-A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
+A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.
 
 ## Key Concepts
 

diff --git a/Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md b/Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md
@@ -2,7 +2,7 @@
 
 ## Description
 
-CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.
+CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.
 
 ## Key Concepts
 

diff --git a/Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md b/Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md
@@ -2,7 +2,7 @@
 
 ## Description
 
-A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
+A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.
 
 ## Key Concepts
 

diff --git a/Samples/6_Performance/README.md b/Samples/6_Performance/README.md
@@ -8,7 +8,7 @@ A simple test, showing huge access speed gap between aligned and misaligned stru
 This sample demonstrates Matrix Transpose.  Different performance are shown to achieve high performance.
 
 ### [UnifiedMemoryPerf](./UnifiedMemoryPerf)
-This sample demonstrates the performance comparision using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.
+This sample demonstrates the performance comparison using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.
 
 ### [cudaGraphsPerfScaling](./cudaGraphsPerfScaling)
 This sample demonstrates the performance characteristics of cuda graphs. It is focused on how the apis scale with graph size.
diff --git a/Samples/6_Performance/UnifiedMemoryPerf/README.md b/Samples/6_Performance/UnifiedMemoryPerf/README.md
@@ -2,7 +2,7 @@
 
 ## Description
 
-This sample demonstrates the performance comparision using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.
+This sample demonstrates the performance comparison using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.
 
 ## Key Concepts
 

diff --git a/Samples/7_libNVVM/device-side-launch/README.md b/Samples/7_libNVVM/device-side-launch/README.md
@@ -2,7 +2,7 @@ Device-Side Launch From NVVM IR
 ===============================
 
 This document is for the programming language and compiler implementers who
-target NVVM IR and plan to support Dynamic Parallelism in their langauge.
+target NVVM IR and plan to support Dynamic Parallelism in their language.
 It provides the low-level details related to supporting kernel launches at
 the NVVM IR level.