Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/datasets
Library/datasetsForked

huggingface/datasets

datasets

🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools

View on GitHub↗Upstream huggingface/datasets↗

Builder

HuggingFace

HuggingFace

huggingface • ai-lab

Stars

21,548

Using upstream star count

Forks

3,227

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Mar 26, 2020

Project creation date

README Summary

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://huggingface.co/datasets/huggingface/documentation-images/raw/main/datasets-logo-dark.svg"> <source media="(prefers-color-scheme: light)" srcset="https://huggingface.co/datasets/huggingface/documentation-images/raw/main/datasets-logo-light.svg"> <img alt="Hugging Face Datasets Library" src="https://huggingface.co/datasets/huggingface/documentation-images/raw/main/datasets-logo-light.svg" width=

Community Evaluation

Loading…

AI Dev Skills

Unmapped

Cross-platform Data CompatibilityData Preprocessing and Feature EngineeringDataset Curation and ManagementData Versioning and ReproducibilityDistributed Data LoadingLarge-scale Data ProcessingMachine Learning Pipeline DevelopmentMemory-efficient Data Handling

Tags

Cross-platform Data CompatibilityData Preprocessing and Feature EngineeringDataset Curation and ManagementData Versioning and ReproducibilityDistributed Data LoadingLarge-scale Data ProcessingMachine Learning Pipeline DevelopmentMemory-efficient Data HandlingCachingCourseCurated ListData ScienceEvalsForkedHuggingFaceMachine LearningNumPyPandasPyTorchPythonReal-Time / StreamingResearch / PapersTensorFlowTransformers

Taxonomy

AI Trends

Open Source AIReproducible AI ResearchCommunity-driven AI DevelopmentStandardized ML Workflows

category

Learning ResourcesFoundation ModelsModel TrainingEvals & BenchmarkingInference & ServingData Science & Analytics

Deployment Context

Cloud APISelf-hostedOn-premise

Modalities

TextImageAudioVideoTabular

Skill Areas

Dataset Curation and ManagementData Preprocessing and Feature EngineeringLarge-scale Data ProcessingMachine Learning Pipeline DevelopmentData Versioning and ReproducibilityDistributed Data LoadingMemory-efficient Data HandlingCross-platform Data Compatibility

tag

CachingCourseCurated ListData ScienceEvalsForkedHuggingFaceMachine LearningNumPyPandasPyTorchPythonReal-Time / StreamingResearch / PapersTensorFlowTransformers

Use Cases

Training Language ModelsComputer Vision Model DevelopmentSpeech Recognition TrainingDataset Benchmarking and EvaluationResearch ReproducibilityMulti-task LearningTransfer Learning ExperimentsData Augmentation Workflows

Recent Activity

Updated 2 months ago

7 Days

0

30 Days

0

90 Days

20

set dev version (#8083)

Quentin Lhoest • Mar 19, 2026

4a0c9c9

Release 4.8.3 (#8082)

Quentin Lhoest • Mar 19, 2026

d4942e2

Fix split_dataset_by_node step (#8081)

Quentin Lhoest • Mar 19, 2026

1b31309

Quality

production
Quality
high
Maturity
production

Categories

Foundation ModelsPrimaryModel TrainingEvals & BenchmarkingInference & ServingData Science & AnalyticsSearch & KnowledgeOther AI / MLLearning Resources

PM Skills

Cost & EfficiencyScale & ReliabilityData & Evaluation

Languages

Python100.0%

Timeline

Project created
Mar 26, 2020
Forked
Mar 22, 2026
Your last push
2 months ago
Upstream last push
16 days ago
Tracked since
Mar 19, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…