Classification datasets

File

Description

Source link (with details)

Preprocessing applied

Label column

generated.csv

Automatically-generated dataset containing data samples separated into very well-delineated categories. This can be considered a “best-case scenario” test case.

label

defaults.csv

Defaults on credit card payments

UCI

Minor (column name reformatting)

defaulted

winequality.csv

Quality ratings of Portuguese white wines

UCI

Added binarized label column recommend indicating quality >= 7

recommend

vehicles.csv

Recognizing vehicle type from its silhouette

OpenML

None

Class

eeg.csv

EEG eye state measurements

OpenML

Dropped a few outlier rows

Class

kick_starter.csv

Kick stater project state

Kaggle

Dropped unnamed columns; Minor column name reformatting; Calculated duration of the project and dropped start and end dates; Dropped some rows with wrong input type; Dropped main category column and kept category column; randomply sampled 30% of the data; Filled NA with 0 for numeric values

state

mushrooms.csv

Classification mushrooms edibility based on physical features

UCI

Renamed the column class to edibility for descriptiveness

edibility

Surgical-deepnet.csv

Surgical cases related to complication

Kaggle

None

complication

gender_classification.csv

use hobbies to guess gender

Kaggle

None

Gender

These can all be loaded using Pandas:

import pandas as pd
dataset = pd.read_csv("file.csv")