vllm-project/vllm
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Builder

vLLM
vllm-project • startup
Stars
75,076
Using upstream star count
Forks
15,115
Using upstream fork count
Open Issues
0
Activity Score
0/100
1038 commits in 30d
Created
Feb 9, 2023
Project creation date
README Summary
vLLM is a fast and memory-efficient inference and serving engine for Large Language Models (LLMs) that provides high-throughput serving with features like continuous batching, optimized CUDA kernels, and support for popular models like Llama, GPT, and others. It offers both offline inference APIs and online serving capabilities with OpenAI-compatible APIs, making it easy to deploy and scale LLM applications. The engine is designed to maximize GPU utilization and minimize memory usage through advanced techniques like PagedAttention.
AI Dev Skills
Unmapped
Tags
Taxonomy
AI Trends
Deployment Context
Modalities
Skill Areas
Recent Activity
Updated 27 days ago
7 Days
194
30 Days
1038
90 Days
3592
Quality
production- Quality
- high
- Maturity
- production
Categories
PM Skills
Languages
Timeline
- Project created
- Feb 9, 2023
- Forked
- Mar 13, 2026
- Your last push
- 27 days ago
- Upstream last push
- 6 days ago
- Tracked since
- Mar 17, 2026
Similar Repos
pgvector cosine similarity · $0
Loading…