from datasets import load_dataset
mnist_train = load_dataset("ylecun/mnist", split='train')
mnist_test = load_dataset("ylecun/mnist", split='test')
mnist_train
Dataset({
features: ['image', 'label'],
num_rows: 60000
})
MNIST
datasetThis is a famous dataset for handwritten digits recognition. More info can be found from its website. There are several versions of the dataset (that almost all machine learning libraries have it in their datasets). Here I provide the version from Huggingface Hub. The tensorflow version is outdated but kept here for reference.
The dataset and description can be found from the Hugging Face hub. You may use the following code to load the dataset. The installation guide of the datasets
library can be found in its homepage.
We may also load it directly in dataset mode.
from datasets import load_dataset
mnist_train = load_dataset("ylecun/mnist", split='train')
mnist_test = load_dataset("ylecun/mnist", split='test')
mnist_train
Dataset({
features: ['image', 'label'],
num_rows: 60000
})
The data point can be accessed by
[<PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>,
<PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>,
<PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>]
Note that you may either slice the dataset, or slice its field (e.g. image or label).
Since we would like to work with matrices instead of the “image” object, we could use map
to transform them.
import numpy as np
def pil_to_array(data):
data['image'] = np.array(data['image']).reshape(1, -1)
return data
mnist_train_processed = mnist_train.map(pil_to_array)
mnist_test_processed = mnist_test.map(pil_to_array)
X_train = np.array(mnist_train_processed['image'])
y_train = np.array(mnist_train_processed['label']).reshape(-1)
X_test = np.array(mnist_test_processed['image'])
y_test = np.array(mnist_test_processed['label']).reshape(-1)
The loaded datasets contains images. We may directly visualize it. Note that we load the dataset in streaming mode, so it is a generator and will give images one by one.
If we first get the dataset mode, it could be transformed into the streaming mode.
mnist_train = load_dataset("ylecun/mnist", split='train')
iter_train = mnist_train.to_iterable_dataset()
nextdata = next(iter(iter_train))
pic = nextdata['image']
pic
We could also apply the transformations to the streaming data.
import numpy as np
def pil_to_array(data):
data['image'] = np.array(data['image']).reshape(1, -1)
return data
mnist_train = load_dataset("ylecun/mnist", split='train')
mnist_test = load_dataset("ylecun/mnist", split='test')
iter_train = mnist_train.to_iterable_dataset().map(pil_to_array)
iter_test = mnist_test.to_iterable_dataset().map(pil_to_array)
tensorflow
version (possibly outdated)tensorflow
/keras
provides the data with the original split. This version is not recommended since keras
changed a lot during recent updates so if you use newer version the following code might not work. In addition it takes a long time to install tensorflow
library.