Dataset Sources and Tools
Quality data is very important for many purposes, in particular in the area of machine learning and artificial intelligence (AI/ML). Two identical systems can deliver different results if the data fed to those systems are different, with respect to quality and/or quantity.
In the list below, we highlight 50+ interesting repositories, which share various datasets or tools can be used to deal with dataset generation or processing.
aaron-xichen/pytorch-playground | Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet) |
apple/ml-hypersim | The Hypersim Toolkit is a set of tools for generating photorealistic synthetic datasets from V-Ray scenes. |
argoai/argoverse-api | Official GitHub repository for Argoverse dataset |
awesomedata/awesome-public-datasets | A topic-centric list of HQ open datasets. |
benedekrozemberczki/datasets | A repository of pretty cool datasets that I collected for network science and machine learning research. |
cair/TsetlinMachine | The code and datasets for the Tsetlin Machine |
chiphuyen/lazynlp | Library to scrape and clean web pages to create massive datasets. |
chrieke/awesome-satellite-imagery-datasets | List of satellite image training datasets with annotations for computer vision and deep learning |
ckan/ckan | CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers datahub.io, catalog.data.gov and europeandataportal.eu/data/en/dataset among many other sites. |
cocodataset/cocoapi | COCO API - Dataset @ http://cocodataset.org/ |
commaai/comma2k19 | A driving dataset for the development and validation of fused pose estimators and mapping algorithms |
covid19-data/covid19-data | COVID-19 workflows and datasets. |
CSAILVision/semantic-segmentation-pytorch | Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset |
datasets/covid-19 | Novel Coronavirus 2019 time series data on cases |
datitran/raccoon_dataset | The dataset is used to train my own raccoon detector and I blogged about it on Medium |
deepmind/mathematics_dataset | This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. |
experiencor/keras-yolo2 | Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows). |
facebookresearch/fastMRI | A large-scale dataset of both raw MRI measurements and clinical MRI images |
facebookresearch/ParlAI | A framework for training and evaluating AI models on a variety of openly available dialogue datasets. |
fbdesignpro/sweetviz | Visualize and compare datasets, target values and associations, with one line of code. |
github/CodeSearchNet | Datasets, tools, and benchmarks for representation learning of code. |
google-research-datasets/natural-questions | Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems. |
google-research-datasets/Objectron | Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the objects position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes |
googlecreativelab/quickdraw-dataset | Documentation on how to access and use the Quick, Draw! Dataset. |
huggingface/nlp | nlp: datasets and evaluation metrics for Natural Language Processing in NumPy, Pandas, PyTorch and TensorFlow |
ieee8023/covid-chestxray-dataset | We are building an open database of COVID-19 cases with chest X-ray or CT images. |
justmarkham/pandas-videos | Jupyter notebook and datasets from the pandas Q&A video series |
kroncrv/datasets | Datasets used for articles and stories made available on Pointer (www.pointer.nl) |
lindawangg/COVID-Net | COVID-Net model for COVID-19 detection on COVIDx dataset |
louisowen6/NLP_bahasa_resources | A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia |
lyft/nuscenes-devkit | Devkit for the public 2019 Lyft Level 5 AV Dataset (fork of https://github.com/nutonomy/nuscenes-devkit) |
mdeff/fma | FMA: A Dataset For Music Analysis |
mims-harvard/TDC | Therapeutics Data Commons: Machine Learning Datasets for Therapeutics |
minimaxir/textgenrnn | Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code. |
neomatrix369/nlp_profiler | A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column. |
NVlabs/ffhq-dataset | Flickr-Faces-HQ Dataset (FFHQ) |
oarriaga/face_classification | Real-time face detection and emotion/gender classification using fer2013/imdb datasets with a keras CNN model and openCV. |
ondyari/FaceForensics | Github of the FaceForensics dataset |
openimages/dataset | The Open Images dataset |
PAIR-code/facets | Visualizations for machine learning datasets |
pangeo-data/WeatherBench | A benchmark dataset for data-driven weather forecasting |
paraschopra/bayesian-neural-network-mnist | Bayesian neural network using Pyro and PyTorch on MNIST dataset |
PolyAI-LDN/conversational-datasets | Large datasets for conversational AI |
pydata/xarray | N-D labeled arrays and datasets in Python |
pytorch/vision | Datasets, Transforms and Models specific to Computer Vision |
rajpurkar/SQuAD-explorer | Visually Explore the Stanford Question Answering Dataset |
Ranlot/single-parameter-fit | Real numbers, data science and chaos: How to fit any dataset with a single parameter |
RedditSota/state-of-the-art-result-for-machine-learning-problems | This repository provides state of the art (SoTA) results for all machine learning problems. We do our best to keep this repository up to date. If you do find a problem’s SoTA result is out of date or missing, please raise this as an issue or submit Google form (with this information: research paper name, dataset, metric, source code and year). We will fix it immediately. |
sebastianruder/NLP-progress | Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. |
simonw/datasette | A tool for exploring and publishing data |
streamlit/demo-self-driving | Streamlit app demonstrating an image browser for the Udacity self-driving-car dataset with realtime object detection using YOLO. |
switchablenorms/DeepFashion2 | DeepFashion2 Dataset https://arxiv.org/pdf/1901.07973.pdf |
target/matrixprofile-ts | A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile |
tensorflow/datasets | TFDS is a collection of datasets ready to use with TensorFlow |
tensorflow/tensor2tensor | Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research. |
ThisIsIsaac/COVID-19_Korea_Dataset | COVID-19 Korea Dataset & Comprehensive Medical Dataset & visualizer |
TorchCraft/StarData | Starcraft AI Research Dataset |
UCSD-AI4H/COVID-CT | COVID-CT-Dataset: A CT Scan Dataset about COVID-19 |
ufoym/imbalanced-dataset-sampler | A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones. |
waymo-research/waymo-open-dataset | Waymo Open Dataset |
willhaslett/covid-19-growth | Daily COVID-19 epidemiological data, piped into friendly Pandas dataframes, functions for dataset construction |
wizyoung/YOLOv3_TensorFlow | Complete YOLO v3 TensorFlow implementation. Support training on your own dataset. |
yining1023/doodleNet | A doodle classifier(CNN), trained on all 345 categories from Quickdraw dataset. |
Yochengliu/awesome-point-cloud-analysis | A list of papers and datasets about point cloud analysis (processing) |
YunYang1994/tensorflow-yolov3 | pure tensorflow Implement of YOLOv3 with support to train your own dataset |