November 19, 2020

1062 words 5 mins read

Dataset Sources and Tools

Dataset Sources and Tools

Quality data is very important for many purposes, in particular in the area of machine learning and artificial intelligence (AI/ML). Two identical systems can deliver different results if the data fed to those systems are different, with respect to quality and/or quantity.

In the list below, we highlight 50+ interesting repositories, which share various datasets or tools can be used to deal with dataset generation or processing.

aaron-xichen/pytorch-playground Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)
apple/ml-hypersim The Hypersim Toolkit is a set of tools for generating photorealistic synthetic datasets from V-Ray scenes.
argoai/argoverse-api Official GitHub repository for Argoverse dataset
awesomedata/awesome-public-datasets A topic-centric list of HQ open datasets.
benedekrozemberczki/datasets A repository of pretty cool datasets that I collected for network science and machine learning research.
cair/TsetlinMachine The code and datasets for the Tsetlin Machine
chiphuyen/lazynlp Library to scrape and clean web pages to create massive datasets.
chrieke/awesome-satellite-imagery-datasets List of satellite image training datasets with annotations for computer vision and deep learning
ckan/ckan CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers datahub.io, catalog.data.gov and europeandataportal.eu/data/en/dataset among many other sites.
cocodataset/cocoapi COCO API - Dataset @ http://cocodataset.org/
commaai/comma2k19 A driving dataset for the development and validation of fused pose estimators and mapping algorithms
covid19-data/covid19-data COVID-19 workflows and datasets.
CSAILVision/semantic-segmentation-pytorch Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset
datasets/covid-19 Novel Coronavirus 2019 time series data on cases
datitran/raccoon_dataset The dataset is used to train my own raccoon detector and I blogged about it on Medium
deepmind/mathematics_dataset This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty.
experiencor/keras-yolo2 Easy training on custom dataset. Various backends (MobileNet and SqueezeNet) supported. A YOLO demo to detect raccoon run entirely in brower is accessible at https://git.io/vF7vI (not on Windows).
facebookresearch/fastMRI A large-scale dataset of both raw MRI measurements and clinical MRI images
facebookresearch/ParlAI A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
fbdesignpro/sweetviz Visualize and compare datasets, target values and associations, with one line of code.
github/CodeSearchNet Datasets, tools, and benchmarks for representation learning of code.
google-research-datasets/natural-questions Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
google-research-datasets/Objectron Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the objects position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
googlecreativelab/quickdraw-dataset Documentation on how to access and use the Quick, Draw! Dataset.
huggingface/nlp nlp: datasets and evaluation metrics for Natural Language Processing in NumPy, Pandas, PyTorch and TensorFlow
ieee8023/covid-chestxray-dataset We are building an open database of COVID-19 cases with chest X-ray or CT images.
justmarkham/pandas-videos Jupyter notebook and datasets from the pandas Q&A video series
kroncrv/datasets Datasets used for articles and stories made available on Pointer (www.pointer.nl)
lindawangg/COVID-Net COVID-Net model for COVID-19 detection on COVIDx dataset
louisowen6/NLP_bahasa_resources A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
lyft/nuscenes-devkit Devkit for the public 2019 Lyft Level 5 AV Dataset (fork of https://github.com/nutonomy/nuscenes-devkit)
mdeff/fma FMA: A Dataset For Music Analysis
mims-harvard/TDC Therapeutics Data Commons: Machine Learning Datasets for Therapeutics
minimaxir/textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
neomatrix369/nlp_profiler A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
NVlabs/ffhq-dataset Flickr-Faces-HQ Dataset (FFHQ)
oarriaga/face_classification Real-time face detection and emotion/gender classification using fer2013/imdb datasets with a keras CNN model and openCV.
ondyari/FaceForensics Github of the FaceForensics dataset
openimages/dataset The Open Images dataset
PAIR-code/facets Visualizations for machine learning datasets
pangeo-data/WeatherBench A benchmark dataset for data-driven weather forecasting
paraschopra/bayesian-neural-network-mnist Bayesian neural network using Pyro and PyTorch on MNIST dataset
PolyAI-LDN/conversational-datasets Large datasets for conversational AI
pydata/xarray N-D labeled arrays and datasets in Python
pytorch/vision Datasets, Transforms and Models specific to Computer Vision
rajpurkar/SQuAD-explorer Visually Explore the Stanford Question Answering Dataset
Ranlot/single-parameter-fit Real numbers, data science and chaos: How to fit any dataset with a single parameter
RedditSota/state-of-the-art-result-for-machine-learning-problems This repository provides state of the art (SoTA) results for all machine learning problems. We do our best to keep this repository up to date. If you do find a problem’s SoTA result is out of date or missing, please raise this as an issue or submit Google form (with this information: research paper name, dataset, metric, source code and year). We will fix it immediately.
sebastianruder/NLP-progress Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
simonw/datasette A tool for exploring and publishing data
streamlit/demo-self-driving Streamlit app demonstrating an image browser for the Udacity self-driving-car dataset with realtime object detection using YOLO.
switchablenorms/DeepFashion2 DeepFashion2 Dataset https://arxiv.org/pdf/1901.07973.pdf
target/matrixprofile-ts A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile
tensorflow/datasets TFDS is a collection of datasets ready to use with TensorFlow
tensorflow/tensor2tensor Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
ThisIsIsaac/COVID-19_Korea_Dataset COVID-19 Korea Dataset & Comprehensive Medical Dataset & visualizer
TorchCraft/StarData Starcraft AI Research Dataset
UCSD-AI4H/COVID-CT COVID-CT-Dataset: A CT Scan Dataset about COVID-19
ufoym/imbalanced-dataset-sampler A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.
waymo-research/waymo-open-dataset Waymo Open Dataset
willhaslett/covid-19-growth Daily COVID-19 epidemiological data, piped into friendly Pandas dataframes, functions for dataset construction
wizyoung/YOLOv3_TensorFlow Complete YOLO v3 TensorFlow implementation. Support training on your own dataset.
yining1023/doodleNet A doodle classifier(CNN), trained on all 345 categories from Quickdraw dataset.
Yochengliu/awesome-point-cloud-analysis A list of papers and datasets about point cloud analysis (processing)
YunYang1994/tensorflow-yolov3 pure tensorflow Implement of YOLOv3 with support to train your own dataset
comments powered by Disqus