9  MNIST dataset

This is a famous dataset for handwritten digits recognition. More info can be found from its website. There are several versions of the dataset (that almost all machine learning libraries have it in their datasets). Here I provide the version from Huggingface Hub. The tensorflow version is outdated but kept here for reference.

9.1 Huggingface Hub

The dataset and description can be found from the Hugging Face hub. You may use the following code to load the dataset. The installation guide of the datasets library can be found in its homepage.

9.1.1 Dataset mode

We may also load it directly in dataset mode.

from datasets import load_dataset

mnist_train = load_dataset("ylecun/mnist", split='train')
mnist_test = load_dataset("ylecun/mnist", split='test')

mnist_train
Dataset({
    features: ['image', 'label'],
    num_rows: 60000
})
mnist_test
Dataset({
    features: ['image', 'label'],
    num_rows: 10000
})

The data point can be accessed by

mnist_train[0]['image']

mnist_train[0]['label']
5
mnist_train['image'][:3]
[<PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>,
 <PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>,
 <PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>]
mnist_train['label'][:3]
[5, 0, 4]

Note that you may either slice the dataset, or slice its field (e.g. image or label).

Since we would like to work with matrices instead of the “image” object, we could use map to transform them.

import numpy as np

def pil_to_array(data):
    data['image'] = np.array(data['image']).reshape(1, -1)
    return data

mnist_train_processed = mnist_train.map(pil_to_array)
mnist_test_processed = mnist_test.map(pil_to_array)

X_train = np.array(mnist_train_processed['image'])
y_train = np.array(mnist_train_processed['label']).reshape(-1)
X_test = np.array(mnist_test_processed['image'])
y_test = np.array(mnist_test_processed['label']).reshape(-1)

9.1.2 Streaming mode

mnist_train_streaming = load_dataset("ylecun/mnist", split='train', streaming=True)
mnist_test_streaming = load_dataset("ylecun/mnist", split='test', streaming=True)

The loaded datasets contains images. We may directly visualize it. Note that we load the dataset in streaming mode, so it is a generator and will give images one by one.

nextdata = next(iter(mnist_train))
pic = nextdata['image']
label = nextdata['label']
pic

label
5

If we first get the dataset mode, it could be transformed into the streaming mode.

mnist_train = load_dataset("ylecun/mnist", split='train')
iter_train = mnist_train.to_iterable_dataset()

nextdata = next(iter(iter_train))
pic = nextdata['image']
pic

We could also apply the transformations to the streaming data.

import numpy as np

def pil_to_array(data):
    data['image'] = np.array(data['image']).reshape(1, -1)
    return data
    
mnist_train = load_dataset("ylecun/mnist", split='train')
mnist_test = load_dataset("ylecun/mnist", split='test')
iter_train = mnist_train.to_iterable_dataset().map(pil_to_array)
iter_test = mnist_test.to_iterable_dataset().map(pil_to_array)

9.2 tensorflow version (possibly outdated)

tensorflow/keras provides the data with the original split. This version is not recommended since keras changed a lot during recent updates so if you use newer version the following code might not work. In addition it takes a long time to install tensorflow library.

import tensorflow.keras as keras
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()