Data Preparation

Learn, Test and Evaluation Data

Splitting into train and test sets

You have your data ready and you are eager to start training the classifier? But be careful: When your classifier will be finished, you will need some test data to evaluate your classifier. If you evaluate your classifier with the data used for learning, you may see surprisingly good results. What we actually want to test is the performance of classifying on unknown data.

For this purpose, we need to split our data into two parts:

  1. A training set with which the learning algorithm adapts or learns the model
  2. A test set to evaluate the generalization performance of the model

When you consider how machine learning normally works, the idea of a split between learning and test data makes sense. Really existing systems train on existing data and if other new data (from customers, sensors or other sources) comes in, the trained classifier has to predict or classify this new data. We can simulate this during training with a training and test data set - the test data is a simulation of "future data" that will go into the system during production.

In this chapter of our Python Machine Learning Tutorial, we will learn how to do the splitting with plain Python.

We will see also that doing it manually is not necessary, because the train_test_split function from the model_selection module can do it for us.

If the dataset is sorted by label, we will have to shuffle it before splitting.

We separated the dataset into a learn (a.k.a. training) dataset and a test dataset. Best practice is to split it into a learn, test and an evaluation dataset.

We will train our model (classifier) step by step and each time the result needs to be tested. If we just have a test dataset. The results of the testing might get into the model. So we will use an evaluation dataset for the complete learning phase. When our classifier is finished, we will check it with the test dataset, which it has not "seen" before!

Yet, during our tutorial, we will only use splitings into learn and test datasets.

Splitting Example: Iris Data Set

We will demonstrate the previously discussed topics with the Iris Dataset.

The 150 data sets of the Iris data set are sorted, i.e. the first 50 data correspond to the first flower class (0 = Setosa), the next 50 to the second flower class (1 = Versicolor) and the remaining data correspond to the last class (2 = Virginica).

If we were to split our data in the ratio 2/3 (learning set) and 1/3 (test set), the learning set would contain all the flowers of the first two classes and the test set all the flowers of the third flower class. The classifier could only learn two classes and the third class would be completely unknown. So we urgently need to mix the data.

Assuming all samples are independent of each other, we want to shuffle the data set randomly before we split the data set as shown above.

In the following we split the data manually:

import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()

Looking at the labels of iris.target shows us that the data is sorted.

iris.target
Output:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The first thing we have to do is rearrange the data so that it is not sorted anymore. For this purpose, we will use the permutation function of the random submodul of Numpy:

indices = np.random.permutation(len(iris.data))
indices
Output:
array([ 98,  56,  37,  60,  94, 142, 117, 121,  10,  15,  89,  85,  66,
        29,  44, 102,  24, 140,  58,  25,  19, 100,  83, 126,  28, 118,
        50, 127,  72,  99,  74,   0, 128,  11,  45, 143,  54,  79,  34,
        32,  95,  92,  46, 146,   3,   9,  73, 101,  23,  77,  39,  87,
       111, 129, 148,  67,  75, 147,  48,  76,  43,  30, 144,  27, 104,
        35,  93, 125,   2,  69,  63,  40, 141,   7, 133,  18,   4,  12,
       109,  33,  88,  71,  22, 110,  42,   8, 134,   5,  97, 114, 135,
       108,  91,  14,   6, 137, 124, 130, 145,  55,  17,  80,  36,  61,
        49,  62,  90,  84,  64, 139, 107, 112,   1,  70, 123,  38, 132,
        31,  16,  13,  21, 113, 120,  41, 106,  65,  20, 116,  86,  68,
        96,  78,  53,  47, 105, 136,  51,  57, 131, 149, 119,  26,  59,
       138, 122,  81, 103,  52, 115,  82])
n_test_samples = 12
learnset_data = iris.data[indices[:-n_test_samples]]
learnset_labels = iris.target[indices[:-n_test_samples]]
testset_data = iris.data[indices[-n_test_samples:]]
testset_labels = iris.target[indices[-n_test_samples:]]
print(learnset_data[:4], learnset_labels[:4])
print(testset_data[:4], testset_labels[:4])
[[5.1 2.5 3.  1.1]
 [6.3 3.3 4.7 1.6]
 [4.9 3.6 1.4 0.1]
 [5.  2.  3.5 1. ]] [1 1 0 1]
[[7.9 3.8 6.4 2. ]
 [5.9 3.  5.1 1.8]
 [6.  2.2 5.  1.5]
 [5.  3.4 1.6 0.4]] [2 2 2 0]

Splits with Sklearn

Even though it was not difficult to split the data manually into a learn (train) and an evaluation (test) set, we don't have to do the splitting manually as shown above. Since this is often required in machine learning, scikit-learn has a predefined function for dividing data into training and test sets.

We will demonstrate this below. We will use 80% of the data as training and 20% as test data. We could just as well have taken 70% and 30%, because there are no hard and fast rules. The most important thing is that you rate your system fairly based on data it did not see during exercise! In addition, there must be enough data in both data sets.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
data, labels = iris.data, iris.target

res = train_test_split(data, labels, 
                       train_size=0.8,
                       test_size=0.2,
                       random_state=42)
train_data, test_data, train_labels, test_labels = res    

n = 7
print(f"The first {n} data sets:")
print(test_data[:7])
print(f"The corresponding {n} labels:")
print(test_labels[:7])
The first 7 data sets:
[[6.1 2.8 4.7 1.2]
 [5.7 3.8 1.7 0.3]
 [7.7 2.6 6.9 2.3]
 [6.  2.9 4.5 1.5]
 [6.8 2.8 4.8 1.4]
 [5.4 3.4 1.5 0.4]
 [5.6 2.9 3.6 1.3]]
The corresponding 7 labels:
[1 0 2 1 1 0 1]

Stratified random sample

Especially with relatively small amounts of data, it is better to stratify the division. Stratification means that we keep the original class proportion of the data set in the test and training sets. We calculate the class proportions of the previous split in percent using the following code. To calculate the number of occurrences of each class, we use the numpy function 'bincount'. It counts the number of occurrences of each value in the array of non-negative integers passed as an argument.

import numpy as np
print('All:', np.bincount(labels) / float(len(labels)) * 100.0)
print('Training:', np.bincount(train_labels) / float(len(train_labels)) * 100.0)
print('Test:', np.bincount(test_labels) / float(len(test_labels)) * 100.0)
All: [33.33333333 33.33333333 33.33333333]
Training: [33.33333333 34.16666667 32.5       ]
Test: [33.33333333 30.         36.66666667]

To stratify the division, we can pass the label array as an additional argument to the train_test_split function:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
data, labels = iris.data, iris.target

res = train_test_split(data, labels, 
                       train_size=0.8,
                       test_size=0.2,
                       random_state=42,
                       stratify=labels)
train_data, test_data, train_labels, test_labels = res 

print('All:', np.bincount(labels) / float(len(labels)) * 100.0)
print('Training:', np.bincount(train_labels) / float(len(train_labels)) * 100.0)
print('Test:', np.bincount(test_labels) / float(len(test_labels)) * 100.0)
All: [33.33333333 33.33333333 33.33333333]
Training: [33.33333333 33.33333333 33.33333333]
Test: [33.33333333 33.33333333 33.33333333]

This was a stupid example to test the stratified random sample, because the Iris data set has the same proportions, i.e. each class 50 elements.

We will work now with the file strange_flowers.txt of the directory data. This data set is created in the chapter Generate Datasets in Python The classes in this dataset have different numbers of items. First we load the data:

content = np.loadtxt("data/strange_flowers.txt", delimiter=" ")
data = content[:, :-1]    # cut of the target column
labels = content[:, -1]
labels.dtype
labels.shape
Output:
(795,)
res = train_test_split(data, labels, 
                       train_size=0.8,
                       test_size=0.2,
                       random_state=42,
                       stratify=labels)
train_data, test_data, train_labels, test_labels = res 

# np.bincount expects non negative integers:
print('All:', np.bincount(labels.astype(int))  / float(len(labels)) * 100.0)
print('Training:', np.bincount(train_labels.astype(int)) / float(len(train_labels)) * 100.0)
print('Test:', np.bincount(test_labels.astype(int)) / float(len(test_labels)) * 100.0)
All: [ 0.         23.89937107 25.78616352 28.93081761 21.3836478 ]
Training: [ 0.         23.89937107 25.78616352 28.93081761 21.3836478 ]
Test: [ 0.         23.89937107 25.78616352 28.93081761 21.3836478 ]