Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ pre-commit install

Now code linters and formatters will be run each time you commit changes.

You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, althoguh please note
You can skip these checks with `git commit --no-verify` or with the short version `git commit -n`, although please note
that this may result in pull requests being rejected if subsequent checks fail.

## Review Process
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -552,9 +552,9 @@ These CUDA features are needed by some CUDA samples. They are provided by either

CUFFT Callback Routines are user-supplied kernel routines that CUFFT will call when loading or storing data. These callback routines are only available on Linux x86_64 and ppc64le systems.

#### CUDA Dynamic Parallellism
#### CUDA Dynamic Parallelism

CDP (CUDA Dynamic Parallellism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above.
CDP (CUDA Dynamic Parallelism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above.

#### Multi-block Cooperative Groups

Expand Down
2 changes: 1 addition & 1 deletion Samples/1_Utilities/topologyQuery/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Description

A simple exemple on how to query the topology of a system with multiple GPU
A simple example on how to query the topology of a system with multiple GPU

## Key Concepts

Expand Down
6 changes: 3 additions & 3 deletions Samples/3_CUDA_Features/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


### [bf16TensorCoreGemm](./bf16TensorCoreGemm)
A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.

### [binaryPartitionCG](./binaryPartitionCG)
This sample is a simple code that illustrates binary partition cooperative groups and reduce within the thread block.
Expand Down Expand Up @@ -36,7 +36,7 @@ This sample demonstrates the use of the new CUDA WMMA API employing the Tensor C
In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.

### [dmmaTensorCoreGemm](./dmmaTensorCoreGemm)
CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.
CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.

### [globalToShmemAsyncCopy](./globalToShmemAsyncCopy)
This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization.
Expand Down Expand Up @@ -69,7 +69,7 @@ A demonstration of CUDA Graphs creation, instantiation and launch using Graphs A
This sample demonstrates basic use of stream priorities.

### [tf32TensorCoreGemm](./tf32TensorCoreGemm)
A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.

### [warpAggregatedAtomicsCG](./warpAggregatedAtomicsCG)
This sample demonstrates how using Cooperative Groups (CG) to perform warp aggregated atomics to single and multiple counters, a useful technique to improve performance when many threads atomically add to a single or multiple counters.
Expand Down
2 changes: 1 addition & 1 deletion Samples/3_CUDA_Features/bf16TensorCoreGemm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Description

A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.

## Key Concepts

Expand Down
2 changes: 1 addition & 1 deletion Samples/3_CUDA_Features/dmmaTensorCoreGemm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Description

CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.
CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.

## Key Concepts

Expand Down
2 changes: 1 addition & 1 deletion Samples/3_CUDA_Features/tf32TensorCoreGemm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Description

A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.
A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register pressure.

## Key Concepts

Expand Down
2 changes: 1 addition & 1 deletion Samples/6_Performance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ A simple test, showing huge access speed gap between aligned and misaligned stru
This sample demonstrates Matrix Transpose. Different performance are shown to achieve high performance.

### [UnifiedMemoryPerf](./UnifiedMemoryPerf)
This sample demonstrates the performance comparision using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.
This sample demonstrates the performance comparison using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.

### [cudaGraphsPerfScaling](./cudaGraphsPerfScaling)
This sample demonstrates the performance characteristics of cuda graphs. It is focused on how the apis scale with graph size.
2 changes: 1 addition & 1 deletion Samples/6_Performance/UnifiedMemoryPerf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Description

This sample demonstrates the performance comparision using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.
This sample demonstrates the performance comparison using matrix multiplication kernel of Unified Memory with/without hints and other types of memory like zero copy buffers, pageable, pagelocked memory performing synchronous and Asynchronous transfers on a single GPU.

## Key Concepts

Expand Down
2 changes: 1 addition & 1 deletion Samples/7_libNVVM/device-side-launch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Device-Side Launch From NVVM IR
===============================

This document is for the programming language and compiler implementers who
target NVVM IR and plan to support Dynamic Parallelism in their langauge.
target NVVM IR and plan to support Dynamic Parallelism in their language.
It provides the low-level details related to supporting kernel launches at
the NVVM IR level.

Expand Down