Library/cutlass
Library/cutlassForked

NVIDIA/cutlass

cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

Builder

NVIDIA

NVIDIA

NVIDIA • big-tech

Stars

9,525

Using upstream star count

Forks

1,768

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Nov 30, 2017

Project creation date

README Summary

CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix multiplication (GEMM) and related computations at all levels and scales within CUDA kernels. It features a flexible, modular, and composable API that separates various concerns to support a broad applicability of programs and to provide a programming model for CUDA kernel specialization and tuning. The library includes Python DSLs for code generation and supports mixed-precision computations optimized for modern GPU architectures.

AI Dev Skills

Unmapped

CUDA ProgrammingGPU Kernel OptimizationHigh-Performance ComputingLinear Algebra OperationsDeep Learning AccelerationMatrix Multiplication OptimizationMemory Hierarchy OptimizationTensor OperationsGPU Architecture UnderstandingPerformance Profiling and Optimization

Tags

CUDA ProgrammingGPU Kernel OptimizationHigh-Performance ComputingLinear Algebra OperationsDeep Learning AccelerationMatrix Multiplication OptimizationMemory Hierarchy OptimizationTensor OperationsGPU Architecture UnderstandingPerformance Profiling and OptimizationHigh-Performance Matrix ComputationsNumerical DataTensorsCustom CUDA Kernel DevelopmentCloud GPU InstancesEdge GPU DevicesEfficient Deep LearningGPU ServersNeural Network Inference OptimizationGPU AccelerationHigh-Performance AI InfrastructureOn-premise GPU ClustersGPU-Accelerated Scientific ComputingDeep Learning Model AccelerationTraining Loop AccelerationHardware-Software Co-optimizationC++

Taxonomy

Recent Activity

Updated 1 months ago

7 Days

0

30 Days

0

90 Days

0

Quality

production
Quality
high
Maturity
production

Categories

Inference & ServingPrimaryML Platform & InfrastructureEdge & Mobile AIOther AI / MLModel Training

PM Skills

Data & EvaluationCost & Efficiency

Languages

C++100.0%

Timeline

Project created
Nov 30, 2017
Forked
Mar 14, 2026
Your last push
1 months ago
Upstream last push
11 days ago
Tracked since
Mar 12, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…