Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/cutlass
Library/cutlassForked

NVIDIA/cutlass

cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

View on GitHub↗Upstream NVIDIA/cutlass↗

Builder

NVIDIA

NVIDIA

NVIDIA • big-tech

Stars

9,810

Using upstream star count

Forks

1,883

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Nov 30, 2017

Project creation date

README Summary

CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular software components and abstractions.

Community Evaluation

Loading…

AI Dev Skills

Unmapped

CUDA ProgrammingDeep Learning AccelerationGPU Architecture UnderstandingGPU Kernel OptimizationHigh-Performance ComputingLinear Algebra OperationsMatrix Multiplication OptimizationMemory Hierarchy OptimizationPerformance Profiling and OptimizationTensor Operations

Tags

CUDA ProgrammingDeep Learning AccelerationGPU Architecture UnderstandingGPU Kernel OptimizationHigh-Performance ComputingLinear Algebra OperationsMatrix Multiplication OptimizationMemory Hierarchy OptimizationPerformance Profiling and OptimizationTensor OperationsAI SafetyC++CLI ToolCurated ListForkedGPU / CUDANode.jsOpen SourcePyTorchPython

Taxonomy

AI Trends

Hardware-Software Co-optimizationEfficient Deep LearningGPU AccelerationHigh-Performance AI Infrastructure

category

Dev Tools & AutomationModel TrainingInference & ServingLearning ResourcesSecurity & Safety

Deployment Context

GPU ServersCloud GPU InstancesOn-premise GPU ClustersEdge GPU Devices

Modalities

Numerical DataTensors

Skill Areas

CUDA ProgrammingGPU Kernel OptimizationHigh-Performance ComputingLinear Algebra OperationsDeep Learning AccelerationMatrix Multiplication OptimizationMemory Hierarchy OptimizationTensor OperationsGPU Architecture UnderstandingPerformance Profiling and Optimization

tag

AI SafetyC++CLI ToolCurated ListForkedGPU / CUDANode.jsOpen SourcePyTorchPython

Use Cases

Deep Learning Model AccelerationHigh-Performance Matrix ComputationsGPU-Accelerated Scientific ComputingNeural Network Inference OptimizationTraining Loop AccelerationCustom CUDA Kernel Development

Recent Activity

Updated 2 months ago

7 Days

0

30 Days

0

90 Days

4

docs: Fix float16 documentation in elementwise_add notebook (#2949) (#3047)

Blake Ledden • Mar 12, 2026

087c84d

Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092)

dePaul Miller • Mar 7, 2026

73c59c0

[fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands (#3087)

Johnsonms • Mar 5, 2026

e5fcd12

Quality

production
Quality
high
Maturity
production

Categories

Dev Tools & AutomationPrimaryInference & ServingLearning ResourcesSecurity & SafetyModel TrainingSafety & AlignmentOther AI / ML

PM Skills

Safety & AlignmentDeveloper Platform

Languages

C++100.0%

Timeline

Project created
Nov 30, 2017
Forked
Mar 14, 2026
Your last push
2 months ago
Upstream last push
21 days ago
Tracked since
Mar 12, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…