FFNN in Python from scratch

To create our network, we will create a class similar to the one we created in the previous chapter for the perceptron. Contrary to what object-oriented programming (OOP) would dictate, we will not take advantage of the perceptron class we previously created, as it's more convenient to work with matrices of weights.

Our goal is to show the code how to understand how to implement the theory we just explained; therefore, our solution will be quite specific for our use case. We know that our network will have three layers, and that the input size will be 2, and we know the number of neurons in the hidden layer:

class FFNN(object):

    def __init__(self, input_size=2, hidden_size=2, output_size=1):
        # Adding 1 as it will be our bias
        self.input_size = input_size + 1
        self.hidden_size = hidden_size + 1
        self.output_size = output_size

        self.o_error = 0
        self.o_delta = 0
        self.z1 = 0
        self.z2 = 0
        self.z3 = 0
        self.z2_error = 0

        # The whole weight matrix, from the inputs till the 
        hidden layer
        self.w1 = np.random.randn(self.input_size, self.hidden_size)
        # The final set of weights from the hidden layer till
        the output layer
        self.w2 = np.random.randn(self.hidden_size, self.output_size)

As we decided to use sigmoid as the activation function, we can add it as an external function. Also, we know we need to compute the derivative, as we are using SGD; therefore, we will implement it as another method. By using the preceding formulas, the implementation is pretty straightforward:

def sigmoid(s):
    # Activation function
    return 1 / (1 + np.exp(-s))

def sigmoid_prime(s):
    # Derivative of the sigmoid
    return sigmoid(s) * (1 - sigmoid(s))

We will then have one function to calculate the forward pass, and one for the backward pass. We will calculate the output using the dot product between the input and the weights and we'll pass everything through the sigmoid:

def forward(self, X):
        # Forward propagation through our network
        X['bias'] = 1 # Adding 1 to the inputs to include the bias 
        in the weight
        self.z1 = np.dot(X, self.w1) # dot product of X (input) 
        and first set of 3x2 weights
        self.z2 = sigmoid(self.z1) # activation function
        self.z3 = np.dot(self.z2, self.w2) # dot product of hidden 
        layer (z2) and second set of 3x1 weights
        o = sigmoid(self.z3) # final activation function
        return o

The forward propagation is also what we will use for predictions, but we will create an alias, as it's most common to use the name predict for this task:

def predict(self, X):
    return forward(self, X)

The most important concept in backpropagation is the backward propagation of the error to adjust the weights and reduce the error. We implement this function in the backward method. For this, we start from the output and calculate the error between our prediction and the actual output. This will be used to calculate the delta that is used to update the weights. In all layers, we take the output of the neurons and use it as input, passing it through the derivative of the sigmoid and multiplying it by the error and the step, also known as the learning rate:

def backward(self, X, y, output, step):
        # Backward propagation of the errors
        X['bias'] = 1 # Adding 1 to the inputs to include the bias 
        in the weight
        self.o_error = y - output # error in output
        self.o_delta = self.o_error * sigmoid_prime(output) * step # 
        applying derivative of sigmoid to error

        self.z2_error = self.o_delta.dot(
            self.w2.T) # z2 error: how much our hidden layer weights 
            contributed to output error
        self.z2_delta = self.z2_error * sigmoid_prime(self.z2) * step # 
        applying derivative of sigmoid to z2 error

        self.w1 += X.T.dot(self.z2_delta) # adjusting first of weights
        self.w2 += self.z2.T.dot(self.o_delta) # adjusting second set 
        of weights

When training the model for each data point, we will do two passes, one forward and one backward. Therefore, our fit method will be as follows:

def fit(self, X, y, epochs=10, step=0.05):
        for epoch in range(epochs):
            X['bias'] = 1 # Adding 1 to the inputs to include the bias 
            in the weight
            output = self.forward(X)
            self.backward(X, y, output, step)

Now, our NN is ready, and it can be used for our task. We will need a training and a testing set again:

# Splitting the dataset in training and test set
msk = np.random.rand(len(data)) < 0.8

# Roughly 80% of data will go in the training set
train_x, train_y = data[['x1', 'x2']][msk], data[['type']][msk].values

# Everything else will go into the validation set
test_x, test_y = data[['x1', 'x2']][~msk], data[['type']][~msk].values

We can now train the network, as follows:

my_network = FFNN()

my_network.fit(train_x, train_y, epochs=10000, step=0.001)

We'll verify the performance of our algorithm, as follows:

pred_y = test_x.apply(my_network.forward, axis=1)

# Reshaping the data
test_y_ = [i[0] for i in test_y]
pred_y_ = [i[0] for i in pred_y]

print('MSE: ', mean_squared_error(test_y_, pred_y_))
print('AUC: ', roc_auc_score(test_y_, pred_y_))

The MSE, after 1,000 epochs is less than 0.01—a pretty good result. We measured the performance by using the ROC Area Under the Curve (AUC), which measures how good we were to order our predictions. With an AUC of over 0.99, we are confident that there are a few mistakes, but the model is still working very well.

It's also possible to verify the performances using a confusion matrix. In this case, we will have to fix a threshold to discriminate between predicting one label or another. As the results are separated by a large gap, a threshold of 0.5 seems appropriate:

threshold = 0.5
pred_y_binary = [0 if i > threshold else 1 for i in pred_y_]

cm = confusion_matrix(test_y_, pred_y_binary, labels=[0, 1])

print(pd.DataFrame(cm,
                   index=['True 0', 'True 1'],
                   columns=['Predicted 0', 'Predicted 1']))

We will obtain one good result that's possible to check with the following confusion matrix:

Visualizing the clusters, it's clear where the errors are, as shown in the following diagram: