Cómo construir una red neuronal de tres capas desde cero

En esta publicación, seguiré los pasos necesarios para construir una red neuronal de tres capas. Analizaré un problema y te explicaré el proceso junto con los conceptos más importantes a lo largo del camino.

El problema a resolver

Un agricultor de Italia tenía un problema con su máquina etiquetadora: mezclaba las etiquetas de tres variedades de vino. ¡Ahora le quedan 178 botellas y nadie sabe qué cultivo las hizo! Para ayudar a este pobre, construiremos un clasificador que reconozca el vino en base a 13 atributos del vino.

El hecho de que nuestros datos estén etiquetados (con una de las etiquetas de los tres cultivares) hace que este sea un problema de aprendizaje supervisado . Básicamente, lo que queremos hacer es usar nuestros datos de entrada (las 178 botellas de vino sin clasificar), pasarlos por nuestra red neuronal y luego obtener la etiqueta correcta para cada cultivo de vino como resultado.

Entrenaremos nuestro algoritmo para mejorar cada vez más en la predicción (y-hat) qué botella pertenece a qué etiqueta.

¡Ahora es el momento de comenzar a construir la red neuronal!

Acercarse

Construir una red neuronal es casi como construir una función muy complicada o armar una receta muy difícil. Al principio, los ingredientes o pasos que tendrás que seguir pueden parecer abrumadores. Pero si desglosas todo y lo haces paso a paso, estarás bien.

En breve:

  • La capa de entrada (x) consta de 178 neuronas.
  • A1, la primera capa, consta de 8 neuronas.
  • A2, la segunda capa, consta de 5 neuronas.
  • A3, la tercera capa y de salida, consta de 3 neuronas.

Paso 1: la preparación habitual

Importe todas las bibliotecas necesarias (NumPy, skicit-learn, pandas) y el conjunto de datos, y defina x e y.

#importing all the libraries and dataset
import pandas as pdimport numpy as np
df = pd.read_csv('../input/W1data.csv')df.head()
# Package imports
# Matplotlib import matplotlibimport matplotlib.pyplot as plt
# SciKitLearn is a machine learning utilities libraryimport sklearn
# The sklearn dataset module helps generating datasets
import sklearn.datasetsimport sklearn.linear_modelfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.metrics import accuracy_score

Paso 2: inicialización

Antes de que podamos usar nuestros pesos, tenemos que inicializarlos. Debido a que todavía no tenemos valores para usar para los pesos, usamos valores aleatorios entre 0 y 1.

En Python, la random.seedfunción genera "números aleatorios". Sin embargo, los números aleatorios no son realmente aleatorios. Los números generados son pseudoaleatorios , lo que significa que los números se generan mediante una fórmula complicada que hace que parezca aleatorio. Para generar números, la fórmula toma como entrada el valor generado previamente. Si no se generó un valor previo, a menudo toma el tiempo como primer valor.

Es por eso que sembramos el generador, para asegurarnos de que siempre obtengamos los mismos números aleatorios . Proporcionamos un valor fijo con el que puede comenzar el generador de números, que es cero en este caso.

np.random.seed(0)

Paso 3: propagación hacia adelante

Hay aproximadamente dos partes del entrenamiento de una red neuronal. Primero, se está propagando hacia adelante a través de la NN. Es decir, está “dando pasos” hacia adelante y comparando esos resultados con los valores reales para obtener la diferencia entre su salida y la que debería ser. Básicamente, ves cómo está funcionando el NN y encuentras los errores.

Después de haber inicializado los pesos con un número pseudoaleatorio, damos un paso lineal hacia adelante. Calculamos esto tomando nuestra entrada A0 multiplicada por el producto escalar de los pesos inicializados aleatorios más un sesgo . Comenzamos con un sesgo de 0. Esto se representa como:

Ahora tomamos nuestro z1 (nuestro paso lineal) y lo pasamos por nuestra primera función de activación . Las funciones de activación son muy importantes en las redes neuronales. Básicamente, convierten una señal de entrada en una señal de salida, por eso también se conocen como funciones de transferencia. Introducen propiedades no lineales a nuestras funciones al convertir la entrada lineal en una salida no lineal, lo que permite representar funciones más complejas.

Hay diferentes tipos de funciones de activación (explicadas en profundidad en este artículo). Para este modelo, elegimos usar la función de activación tanh para nuestras dos capas ocultas, A1 y A2, lo que nos da un valor de salida entre -1 y 1.

Dado que este es un problema de clasificación de clases múltiples (tenemos 3 etiquetas de salida), usaremos la función softmax para la capa de salida - A3 - porque esto calculará las probabilidades para las clases escupiendo un valor entre 0 y 1.

Al pasar z1 a través de la función de activación, hemos creado nuestra primera capa oculta, A1, que se puede utilizar como entrada para el cálculo del siguiente paso lineal, z2.

En Python, este proceso se ve así:

# This is the forward propagation functiondef forward_prop(model,a0): # Load parameters from model W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'], model['W3'],model['b3'] # Do the first Linear step z1 = a0.dot(W1) + b1 # Put it through the first activation function a1 = np.tanh(z1) # Second linear step z2 = a1.dot(W2) + b2 # Put through second activation function a2 = np.tanh(z2) #Third linear step z3 = a2.dot(W3) + b3 #For the Third linear activation function we use the softmax function a3 = softmax(z3) #Store all results in these values cache = {'a0':a0,'z1':z1,'a1':a1,'z2':z2,'a2':a2,'a3':a3,'z3':z3} return cache

Al final, todos nuestros valores se almacenan en la caché.

Paso 4: propagación hacia atrás

Después de propagar hacia adelante a través de nuestro NN, propagamos hacia atrás nuestro gradiente de error para actualizar nuestros parámetros de peso. Conocemos nuestro error y queremos minimizarlo tanto como sea posible.

We do this by taking the derivative of the error function, with respect to the weights (W) of our NN, using gradient descent.

Lets visualize this process with an analogy.

Imagine you went out for a walk in the mountains during the afternoon. But now its an hour later and you are a bit hungry, so it’s time to go home. The only problem is that it is dark and there are many trees, so you can’t see either your home or where you are. Oh, and you forgot your phone at home.

But then you remember your house is in a valley, the lowest point in the whole area. So if you just walk down the mountain step by step until you don’t feel any slope, in theory you should arrive at your home.

So there you go, step by step carefully going down. Now think of the mountain as the loss function, and you are the algorithm, trying to find your home (i.e. the lowest point). Every time you take a step downwards, we update your location coordinates (the algorithm updates the parameters).

The loss function is represented by the mountain. To get to a low loss, the algorithm follows the slope — that is the derivative — of the loss function.

When we walk down the mountain, we are updating our location coordinates. The algorithm updates the parameters of the neural network. By getting closer to the minimum point, we are approaching our goal of minimizing our error.

In reality, gradient descent looks more like this:

We always start with calculating the slope of the loss function with respect to z, the slope of the linear step we take.

Notation is as follows: dv is the derivative of the loss function, with respect to a variable v.

Next we calculate the slope of the loss function with respect to our weights and biases. Because this is a 3 layer NN, we will iterate this process for z3,2,1 + W3,2,1 and b3,2,1. Propagating backwards from the output to the input layer.

This is how this process looks in Python:

# This is the backward propagation functiondef backward_prop(model,cache,y):
# Load parameters from model W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'],model['W3'],model['b3'] # Load forward propagation results a0,a1, a2,a3 = cache['a0'],cache['a1'],cache['a2'],cache['a3'] # Get number of samples m = y.shape[0] # Calculate loss derivative with respect to output dz3 = loss_derivative(y=y,y_hat=a3)
# Calculate loss derivative with respect to second layer weights dW3 = 1/m*(a2.T).dot(dz3) #dW2 = 1/m*(a1.T).dot(dz2) # Calculate loss derivative with respect to second layer bias db3 = 1/m*np.sum(dz3, axis=0) # Calculate loss derivative with respect to first layer dz2 = np.multiply(dz3.dot(W3.T) ,tanh_derivative(a2)) # Calculate loss derivative with respect to first layer weights dW2 = 1/m*np.dot(a1.T, dz2) # Calculate loss derivative with respect to first layer bias db2 = 1/m*np.sum(dz2, axis=0) dz1 = np.multiply(dz2.dot(W2.T),tanh_derivative(a1)) dW1 = 1/m*np.dot(a0.T,dz1) db1 = 1/m*np.sum(dz1,axis=0) # Store gradients grads = {'dW3':dW3, 'db3':db3, 'dW2':dW2,'db2':db2,'dW1':dW1,'db1':db1} return grads

Step 5: the training phase

In order to reach the optimal weights and biases that will give us the desired output (the three wine cultivars), we will have to train our neural network.

I think this is very intuitive. For almost anything in life, you have to train and practice many times before you are good at it. Likewise, a neural network will have to undergo many epochs or iterations to give us an accurate prediction.

When you are learning anything, lets say you are reading a book, you have a certain pace. This pace should not be too slow, as reading the book will take ages. But it should not be too fast, either, since you might miss a very valuable lesson in the book.

In the same way, you have to specify a “learning rate” for the model. The learning rate is the multiplier to update the parameters. It determines how rapidly they can change. If the learning rate is low, training will take longer. However, if the learning rate is too high, we might miss a minimum. The learning rate is expressed as:

  • := means that this is a definition, not an equation or proven statement.
  • ais the learning rate called alpha
  • dL(w) is the derivative of the total loss with respect to our weight w
  • da is the derivative of alpha

We chose a learning rate of 0.07 after some experimenting.

# This is what we return at the endmodel = initialise_parameters(nn_input_dim=13, nn_hdim= 5, nn_output_dim= 3)model = train(model,X,y,learning_rate=0.07,epochs=4500,print_loss=True)plt.plot(losses)

Finally, there is our graph. You can plot your accuracy and/or loss to get a nice graph of your prediction accuracy. After 4,500 epochs, our algorithm has an accuracy of 99.4382022472 %.

Brief summary

We start by feeding data into the neural network and perform several matrix operations on this input data, layer by layer. For each of our three layers, we take the dot product of the input by the weights and add a bias. Next, we pass this output through an activation function of choice.

The output of this activation function is then used as an input for the following layer to follow the same procedure. This process is iterated three times since we have three layers. Our final output is y-hat, which is the prediction on which wine belongs to which cultivar. This is the end of the forward propagation process.

We then calculate the difference between our prediction (y-hat) and the expected output (y) and use this error value during backpropagation.

During backpropagation, we take our error — the difference between our prediction y-hat and y — and we mathematically push it back through the NN in the other direction. We are learning from our mistakes.

By taking the derivative of the functions we used during the first process, we try to discover what value we should give the weights in order to achieve the best possible prediction. Essentially we want to know what the relationship is between the value of our weight and the error that we get out as the result.

And after many epochs or iterations, the NN has learned to give us more accurate predictions by adapting its parameters to our dataset.

This post was inspired by the week 1 challenge from the Bletchley Machine Learning Bootcamp that started on the 7th of February. In the coming nine weeks, I’m one of 50 students who will go through the fundamentals of Machine Learning. Every week we discuss a different topic and have to submit a challenge, which requires you to really understand the materials.

If you have any questions or suggestions or, let me know!

Or if you want to check out the whole code, you can find it here on Kaggle.

Recommended videos to get a deeper understanding on neural networks:

  • 3Blue1Brown’s series on neural networks
  • Siraj Raval’s series on Deep Learning