Evaluating the model

To evaluate an algorithm, it's necessary to judge the performance of the algorithm on data that was not used to train the model. For this reason, it's common to split the data in the training and test set. The training set is used to train the model, which means that it's used to find the parameters of our algorithm. For example, training a decision tree will determine the values and variables that will create the split of the branches of the tree. The test set must remain totally hidden from the training. That means that all operations such as features engineering or feature scaling must be trained on the training set only and applied to the test set, as in the following example.

Usually, the training set will be 70-80% of the dataset, while the test set will be the rest:

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn import datasets

# import some data
iris = datasets.load_iris()

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=0)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_train)

clf = LinearRegression().fit(X_train_transformed, y_train)

predictions = clf.predict(X_test_transformed)

print('Predictions: ', predictions)

The most common way to evaluate a supervised learning algorithm offline is cross-validation. This technique consists of dividing the dataset into test and training multiple times and use one part for training and one for testing.

This allows to not only check for overfitting but also to evaluate the variance in our loss

For problems where it's not possible to randomly divide the data, such as in a time series, scikit-learn has other splitting methods, such as the TimeSeriesSplit class.

In Keras, it's possible to specify a simple way to split in train/test directly during fit:

hist = model.fit(x, y, validation_split=0.2)

If the data does not fit in memory, it's also possible to use train_on_batch and test_on_batch.

For image data, in Keras, it is also possible to use the folder structure to create train and test and specify the labels. To accomplish this, it is important to use the flow_from_directory function, which will load the data with the labels and train/test split as specified. We will need to have the following directory structure:

data/
    train/
        category1/
            001.jpg
            002.jpg
            ...
        category2/
            003.jpg
            004.jpg
            ...
    validation/
        category1/
            0011.jpg
            0022.jpg
            ...
        category2/
            0033.jpg
            0044.jpg
            ...

Use the following function:

flow_from_directory(directory, target_size=(96, 96), color_mode='rgb', classes=None, class_mode='categorical', batch_size=128, shuffle=True, seed=11, save_to_dir=None, save_prefix='output', save_format='jpg', follow_links=False, subset=None, interpolation='nearest')