cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication#

NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a sparse matrix:

$D = Activation(\alpha op(A) \cdot op(B) \beta op(C) bias)$

where $op(A)/op(B)$ refers to in-place operations such as transpose/non-transpose, and $alpha, beta$ are scalars or vectors.

The cuSPARSELt APIs allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.

Download: developer.nvidia.com/cusparselt/downloads

Provide Feedback: Math-Libs-Feedback@nvidia.com

Examples: cuSPARSELt Example 1, cuSPARSELt Example 2

Blog post:

Key Features#

NVIDIA Sparse MMA tensor core support

Mixed-precision computation support:

Input A/B

Input C

Output D

Compute

Support arch

FP32

FP32

FP32

FP32

SM 8.0, 8.6, 8.7, 9.0

BF16

BF16

BF16

FP32

FP16

FP16

FP16

FP32

FP16

FP16

FP16

FP16

SM 9.0

INT8

INT8

INT8

INT32

SM 8.0, 8.6, 8.7, 9.0

INT32

INT32

FP16

FP16

BF16

BF16

E4M3

FP16

E4M3

FP32

SM 9.0

BF16

E4M3

FP16

FP16

BF16

BF16

FP32

FP32

E5M2

FP16

E5M2

FP32

SM 9.0

BF16

E5M2

FP16

FP16

BF16

BF16

FP32

FP32

Matrix pruning and compression functionalities
Activation functions, bias vector, and output scaling
Batched computation (multiple matrices in a single run)
GEMM Split-K mode
Auto-tuning functionality (see cusparseLtMatmulSearch())
NVTX ranging and Logging functionalities

Input A/B	Input C	Output D	Compute	Support arch
`FP32`	`FP32`	`FP32`	`FP32`	`SM 8.0, 8.6, 8.7, 9.0`
`BF16`	`BF16`	`BF16`	`FP32`
`FP16`	`FP16`	`FP16`	`FP32`
`FP16`	`FP16`	`FP16`	`FP16`	`SM 9.0`
`INT8`	`INT8`	`INT8`	`INT32`	`SM 8.0, 8.6, 8.7, 9.0`
`INT32`	`INT32`
`FP16`	`FP16`
`BF16`	`BF16`
`E4M3`	`FP16`	`E4M3`	`FP32`	`SM 9.0`
`BF16`	`E4M3`
`FP16`	`FP16`
`BF16`	`BF16`
`FP32`	`FP32`
`E5M2`	`FP16`	`E5M2`	`FP32`	`SM 9.0`
`BF16`	`E5M2`
`FP16`	`FP16`
`BF16`	`BF16`
`FP32`	`FP32`

Support#

Supported SM Architectures: SM 8.0, SM 8.6, SM 8.7, SM 8.9, SM 9.0
Supported CPU architectures and operating systems:

OS	CPU archs
`Windows`	`x86_64`
`Linux`	`x86_64`, `Arm64`

cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication#

Key Features#

Support#

Index#