CUDA Utils

CUDA Utils is a header-only library that significantly simplifies complex CUDA kernel code. It provides intuitive wrapper classes for multi-dimensional tensors, making CUDA programming more readable and less error-prone, especially for advanced use cases like high-performance GEMM implementations.

Usage Examples

The following examples demonstrate how CUDA Utils can dramatically improve code readability and reduce complexity in CUDA kernels. These snippets are based on real-world usage in high-performance GEMM kernels like QuadMul, OctoMul, and OctoQuadMul.

Example 1: Simplified Dynamic Indexing and Masking

Before:

if (input_mask[batch_idx * num_heads * input_dim1 * input_dim2  
               head_idx * input_dim1 * input_dim2  
               mask_i * input_dim2   mask_j] == 0) {
    output_tensor[batch_idx * num_heads * output_dim1 * output_dim2  
                  head_idx * output_dim1 * output_dim2  
                  i * output_dim2   j] = -INFINITY;
}

After (with CUDA Utils):

GMemTensor4D<float> output(output_tensor, batch_size, num_heads, output_dim1, output_dim2);
GMemTensor4D<int> mask(input_mask, batch_size, num_heads, input_dim1, input_dim2);

if (mask.get(batch_idx, head_idx, mask_i, mask_j) == 0) {
    output.set(batch_idx, head_idx, i, j, -INFINITY);
}

Example 2: Simplified Memory Loading in GEMM Kernels

Before:

uint8_t *shared_ptr = &shared_A[stage][row * Config::kTileSizeK   col];
uint8_t *global_ptr = &A[batch_idx * M * Config::K   
                         (block_row_start   row) * Config::K   
                         k_offset   col];
__pipeline_memcpy_async(shared_ptr, global_ptr, sizeof(Data128B));

After (with CUDA Utils):

__pipeline_memcpy_async(
    smemA.get_ptr(stage, row, col),
    gmemA.get_ptr(batch_idx, block_row_start   row, k_offset   col),
    sizeof(Data128B));

Benefits

Improved Readability: Complex indexing operations become self-explanatory.
Reduced Errors: Multi-dimensional index calculations are encapsulated, minimizing indexing errors.
Performance-Oriented: Designed for high-performance computing with efficient memory access patterns.
Type-Safe Memory Reinterpretation: get_reinterpreted<>() and set_reinterpreted<>() methods allow safe and easy reinterpretation of memory.
Simplified Shared Memory Management: Easier setup and access to shared memory in complex kernels.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
utils.cuh		utils.cuh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Utils

Usage Examples

Example 1: Simplified Dynamic Indexing and Masking

Before:

After (with CUDA Utils):

Example 2: Simplified Memory Loading in GEMM Kernels

Before:

After (with CUDA Utils):

Benefits

License

About

Releases

Packages

Languages

License

carsonpo/cuda-utils

Folders and files

Latest commit

History

Repository files navigation

CUDA Utils

Usage Examples

Example 1: Simplified Dynamic Indexing and Masking

Before:

After (with CUDA Utils):

Example 2: Simplified Memory Loading in GEMM Kernels

Before:

After (with CUDA Utils):

Benefits

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages