Train and Test Sets

Splitting into train and test sets

You have your data ready and you are eager to start training the classifier? But be careful: When your classifier will be finished, you will need some test data to evaluate your classifier. If you evaluate your classifier with the data used for learning, you may see surprisingly good results. What we actually want to test is the performance of classifying on unknown data.

For this purpose, we need to split our data into two parts:

  1. A training set with which the learning algorithm adapts or learns the model
  2. A test set to evaluate the generalization performance of the model

In this chapter, we will learn how to do this with plain Python.

We will see also that doing it manually is not necessary, because the train_test_split function from the model_selection module can do it for us.

If the dataset is sorted by label, we will have to shuffle it before splitting.

We separated the dataset into a learn (a.k.a. training) dataset and a test dataset. Best practice is to split it into a learn, test and an evaluation dataset.

We will train our model (classifier) step by step and each time the result needs to be tested. If we just have a test dataset. The results of the testing might get into the model. So we will use an evaluation dataset for the complete learning phase. When our classifier is finished, we will check it with the test dataset, which it has not "seen" before!

Yet, during our tutorial, we will only use splitings into learn and test datasets.

First, we split the Iris dataset manually.

import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()

Looking at the labels of iris.target shows us that the data is sorted.

iris.target
Output::
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The first thing we have to do is rearrange the data so that it is not sorted anymore.

indices = np.random.permutation(len(iris.data))
indices
Output::
array([115, 114,  52, 137,  51, 112, 110,  96,  59,  66, 105, 138,   5,
       146,   6,  78,  46,  53,  14,  91,  17,  47, 123, 148,  58, 133,
         8,  41,  88, 118,  48,  42,  65,   4,  55,  38,  68,  60, 147,
       128,  70,  25,  27, 119, 120,  10,  98,  18,  40,  79,  99, 121,
         2,  71,  89,  81, 143,  44,  97, 124, 106,  36,  61, 103,  30,
        64,   9,  24, 102,   3,  35,  54,  94,  77,  20,  23,  80,  34,
        95, 126,  50, 139,  49, 125,   0,   7,  76, 127,  67,  37, 132,
       113,  29, 134,  28, 129,  56,  19, 144,  83, 122,  45,  85, 135,
        16, 130,  11,  72,  33,  57,  13, 109,  93,  26,  74, 107,  15,
       116,  21,  86, 111,  82, 141, 131, 140, 101,  32, 145,  75,  69,
        87,  62,  84, 117,  39, 149, 104,  63, 142,   1,  92, 108,  90,
        73,  43, 100, 136,  31,  22,  12])
n_test_samples = 12
learnset_data = iris.data[indices[:-n_test_samples]]
learnset_labels = iris.target[indices[:-n_test_samples]]
testset_data = iris.data[indices[-n_test_samples:]]
testset_labels = iris.target[indices[-n_test_samples:]]
print(learnset_data[:4], learnset_labels[:4])
print(testset_data[:4], testset_labels[:4])
[[6.4 3.2 5.3 2.3]
 [5.8 2.8 5.1 2.4]
 [6.9 3.1 4.9 1.5]
 [6.4 3.1 5.5 1.8]] [2 2 1 2]
[[5.8 2.7 5.1 1.9]
 [4.9 3.  1.4 0.2]
 [5.8 2.6 4.  1.2]
 [6.7 2.5 5.8 1.8]] [2 0 1 2]

It was not difficult to split the data manually into a learn (train) and an evaluation (test) set. Yet, it isn't necessary, because sklearn provides us with a function to do it.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
data, labels = iris.data, iris.target

res = train_test_split(data, labels, 
                       train_size=0.8,
                       test_size=0.2,
                       random_state=42)
train_data, test_data, train_labels, test_labels = res    

print("Labels for training and testing data")
print(test_data[:5])
print(test_labels[:5])
Labels for training and testing data
[[6.1 2.8 4.7 1.2]
 [5.7 3.8 1.7 0.3]
 [7.7 2.6 6.9 2.3]
 [6.  2.9 4.5 1.5]
 [6.8 2.8 4.8 1.4]]
[1 0 2 1 1]