Deep Learning and the problem of teaching trendy topics inÂ academia
From August 2018 to December 2018, I have been studying at the National University of Singapore for an exchange semester. It was basically my last semester before my final internship; so I mainly wanted to study things that I like but that I hadnât had the chance to explore in various topics in computer science, statistics and applied maths at my homeÂ university.
There is this big new and trending topics in research and in the industry, Deep Learning, this is not that much taught in universities^{1}. As far I know, the main thing are being accessible are MOOCS ^{2}and âquick handsonâ but that are technology or libraryspecific. Yet, there were several courses open for students at NUS that are tackling this topic â the most popular one being CS5242: Neural Networks and Deep Learning.
I registered for this course to understand how such a recent and moving field â sometimes prone to speculative theoretical claims â could be taught: Deep Learning is quite enormous and new as a topic, so much that is hard to keep up to date with the latest trends over everyÂ months.
Such a situation is quite disturbing for professors â who generally master their topic at their fingertips â who now have to adapt or completely change some part of their course eachÂ semester.
Hence, the approach taken by the lecturer was a pragmatically smart and honest one: he chose to essentially focus on the atomic building blocks and the main intuitions for results, expecially for the first introductive part of theÂ course.
âWhat I cannot create, I do not understandâ
Main concepts â that you probably already know â like the neuron (another
name for a nice package of a scalar product and a nonlinearity), layer,
backpropagation, and gradient methods, some multivariables calculs
derivations, etc. were covered exhaustively during the first weeks with simple
but meaningful results and even with some craftmanship I would say: to
concretely understand it, we had to build, train, and compare some
architectures ourselves only using numpy
matrices and theirÂ operations.
And reimplementing the wheel is something I really do love to do: should it takes time and should it be not a âproductiveâ exercice, it is maybe the best way to really understand those concepts in depth.
But I do not only like to build things that work, it even better when one can also make complex things simple so that they can then better be understood, extended orÂ redesigned.
And here, this is not something that is studied by machine learning generally: this is software design, and software engineering.
So instead of just completing the assignement, I decided to create a small library to easily play with all thoseÂ notions.
I call it joml
, which basically mean âjoml
One More
Layerâ^{3}.
Letâs go through the design and the process of creating, and training Neural
Networks using this library: this will also introduce the way joml
has beenÂ designed.
Notations andÂ definitions
Before diving in the details of the training procedure, letâs present some
notations. As well as a general overview on the way networks things are
implemented in joml
.
Network andÂ layers
Letâs note \(\mathcal{N}\) a neural network. As you know, this network consists of several â and possibly different â Layers.
For now, we will only stick to FullyConnected layers which will simply be referred as Layer in what follows. Such layers can be abstracted to allow extending this notion and define new type ofÂ layers.
Typically a Layer, is nothing more than a collection of neurons, that is a collection of weights and biases embedding a activation function. Those weights and bias are respectively used to perform a scalar product on a vector and an addition of the result of this firstÂ operation.
For a given neuron \(i\), if we note \(\omega_i\) its vector of weights, \(b_i\) the bias and \(x\) the vector to operate on ; considering vectors as columns we have the value \(z_i\) of the result of those two operationsÂ is:
What neural networks mainly consist in is stacking Layers together, but before doing it, we have to use something that binds this results â overwise stacking as many layers directly would result in just a linear transformation that is a simple Layer as far as we have them been defined, see the Bonus Track at the end for a quick example ofÂ this.
To bind this results, each neuron uses a nonlinearity that is function \(\sigma:\mathbb{R}\mapsto\mathbb{R}\) with nice property. Such a function is also called an activation function, which is defined bellow. So instead of \(z_i\) being the output of the neuron \(i\), we do have \(a_i\) that isÂ simply:
If we generalise it at the scale of the Layer by introducing the vector of bias \(b\), and the matrix \(W\) whose \(i\)th line correspond to the weights of the \(i\)â»th neurone, we have â assuming that the \(\sigma\) in \eqref{output} is simply the broadcasted version of \(\sigma\) on a wholeÂ vector.
Finally, if we note \(L\) the number of its layers. We can for each layer \(l\in \{1,\cdots, L\}\) note \(W^{(l1,l)}\) the matrix of weights used in the transformation from the previous layer \(l1\) to this considered \(l\) layer; we also note \(b^{(l)}\) the vector of biases of this layer. The equivalent of layer \(0\) corresponds to the input \(x\) and has no associated weights norÂ biases.
Hence, if we note \(N_l\) the size of the layer \(l\), we have byÂ construction:
Thus, a Layer \(l\) is entirely specified using a size \(N_l\), an activation function \(\sigma_l\), and a pair of parameters \((W^{(l1,l)}, b^{(l)})\).
In joml
, this concept is is implemented under the Layer
class. A
Layer
wraps a matrice of weights and biases as well and also embeds
optimizers and informations about a previous Layer
. Some caches are also
needed for the trainingÂ procedure.
Finally, we can define a network \(\mathcal{N}\) as an ordered collection of \(L\) layers, with an input size \(N_0\). It can be seen as a function \(\mathcal{N}: \mathbb{R}^{N_0} \to \mathbb{R}^{N_L}\) definedÂ as:
Prediction
For a vector \(x^{(j)}\) , we define \(\mathcal{N}(x^{(j)})\) as its prediction and we note it \(\widehat{y}^{(j)}\).
In the end, we will be comparing it tothe real value of the mapping of \(x^{(j)}\) that we will note \(y^{(j)}\).
In joml
, a network is implemented under the Network
class.
It basically wraps Layers
and define generic methods to perform all the steps
that are deleguated toÂ them.
ActivationÂ functions
Activation functions are functions that are used in Layer
to perform a
nonlinear mapping on the result of the scalarÂ product.
The sigmoid function â denoted \(s\), see equation \eqref{sigmoid} â used to be the canonical activation function used in neurons â as it was a really smooth function (it belongs to \(C^\infty\)), whose derivate â \(s'\) given in \eqref{dersigmoid} â is quiteÂ nice.
Now, people tend to only use letâs define the ReLu, and softmax activation functions as well as the crossentropyÂ function.
ReLu can be defined as a scalarÂ function:
It is a simple (but not invertible) activation function whose derivativeÂ is:
Generally, activation functions are seen as their broadcasted version on vectors and matrices. That is in the case of ReLu:
softmax is the function of choice when it comes to choose a modality over a set of multiples or to weight those modalities when given associate scorings in \(\mathbb{R}\).
softmax is definedÂ as:
One could perform this choice between modalities only using an easier normalisation or not an hard max that would select the modality with the highest score, but the pros of softmax is that it is flexible and more appropriate when one has to make such choices with greater values^{4}. It also nicely pairs with a cost function as we are going toÂ see.
In joml
, the broadcast on the value of the function and on its derivative is
is performed via a template method on the ActivationFunction.value
and
ActivationFunction.der
: this way only the scalar values are to be
given for each new definition of an activation function; see the source code
of
ActivationFunction
,
the one of
ReLu
,
and the one of
SoftMax
.
CostÂ Function
Cost functions, also called loss functions or objectives, are used to quantify the error that is to minimize to attain aÂ goal.
Here, our goal is to correctly predict output of new data â so we want to minimize a distance we have between the predictions and the realÂ outputs.
The most simple errors we could think of, is simply the mean absolute error, â abbreviated MAE â or the mean squared error â abbreviated MSE.
In fact, those can be used in the scalar case (think of predicting 0/1 only, or a scalar value only). But in the case of prediction between multiple modalities we need somethingÂ else.
The crossentropy, noted \(H\), is defined on two probabilistic distributions represented here by the vectors \(\hat{y}\) and \(y\)Â as:
The crossentropy measures the proximity of two probabilistic distributions: the smaller the crossentropy of two probabilistic distributions is, the more similar those two distributionsÂ are.
It is chosen because it is easy to optimize when used withÂ softmax.
We now have all the pieces to tackle the main pieces that is how to train neural networks.
TrainingÂ procedure
The internals of the backpropagation algorithm are not described here â yet, because writting it is long, and I donât have that much time now. This last (yet important) part of building neural network has been covered extensively in the litterature and there are really exceptional resources online^{5}.
Moreover, a vast variety of optimizers exists, Adam^{6} being a pretty nice and robustÂ one.
All the art and engineering consists in finding a good set of pais of matrices of weights and biases \(\{(W^{(l1,l)}, b^{(l)})\}_{l\in \{1, \cdots, L}\}\) so that a positive aggregation â the cost function â of distance between pairs of vectors \((y^{(j)}, \widehat{y}^{(j)})_j\) isÂ minimized.
Letâs summarize the workflow of the training of a neural network. It breaks up in 5Â steps:

ForwardPropagation: Each input \(x\in \mathbb{R}^{N_0}\) gets forwardpropagated in the network to get its output \(\hat{y}\in \mathbb{R}^{N_L}\). For each layer \(l\), its input \(a^{(l1)}\) as well as its affine transformation \(z^{(l)}\) gets persisted ;
\begin{equation} \left\{ \begin{matrix} z^{(0)} &=& x &\in \mathbb{R}^{N_0}\\ z^{(l)} &=& W^{(l,l1)} a^{(l1)} + b^{(l)} &\in \mathbb{R}^{N_l}\\ a^{(l)} &=& \sigma_l(z^{(l)}) &\in \mathbb{R}^{N_l}\\ \hat{y} &=& a^{(L)} &\in \mathbb{R}^{N_L} \end{matrix} \right. \end{equation}Here, we do not just consider a single pair \((x,y)\), but generally \(n\) pairs \((x^{(i)},y^{(i)})\).See how this is performed in
joml
in aNetwork
and in aLayer
. 
Cost Evaluation: The differents outputs \(\hat{y}^{(i)}\) of the inputs \(x^{(i)}\) are then compared to their labels \(y^{(i)}\): the cost for the crossentropy is computed using the estimator given in equation \eqref{estimatorCost}.
\begin{equation} \label{estimatorCost} \hat{H}\left(\{(\hat{y}^{(i)},y^{(i)})\}_{i \leq n}\}\right) = \frac{1}{n} \sum_{i=1}^n H(\hat{y}^{(i)},y^{(i)}) =  \frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K \hat{y}^{(i)}_k \log y^{(i)}_k \end{equation} 
BackPropagation: The error \(\delta^{(L)}\) is calculated using \(\nabla_{a^{(L)}}\hat{H}\), the gradient of the (estimated) cost \(\hat{H}\) with respect to the vectors of outputs, \(a^{(L)} \in \mathbb{R}^{N_L}\) (see equation \eqref{lastErr}) and is then backpropagated in the network ^{7}.
\begin{equation} \label{lastErr} \delta^{(L)} = \nabla_{a^{(L)}}\hat{H} \odot \sigma_L'(z^{(L)}) \end{equation}From here, the error \(\delta^{(l)}\) associated to each layer \(l\) can be calculated using the error \(\delta^{(l+1)}\) associated to the layer \(l+1\) using the recurrence relation \eqref{recurrenceRelation}.\begin{equation} \label{recurrenceRelation} \delta^{(l)} = W^{(l,l+1)\top} \delta^{(l+1)} \odot \sigma_L'(z^{(l)}) \end{equation}Each layer \(l\) then persists its associated error \(\delta^{(l)}\).See how those two steps are performed in
joml
in aNetwork
and in aLayer
. 
Gradients calculation: For each layer, gradients of weights and biases get calculated using the persisted activation \(a_l\) and persisted error \(\delta_l\)^{8}.
\begin{equation} \label{gradients} \left\{ \begin{matrix} \nabla_{W^{(l,l1)}} \hat{H} &=& \delta^{(l)}\, a^{(l1)\top}\\ \nabla_{b^{(l)}} \hat{H} &=& \delta^{(l)}\\ \end{matrix} \right. \end{equation}See how this is performed in
joml
in aNetwork
, in aLayer
and in theSoftMaxCrossEntropyOutputLayer
. 
Parameters update using gradient descent: A optimisation routine is then executed to update the weights and biases accordingly. Here a first order method â generally a variant of the gradient descent â is used to optimise each layer weights and biases. The simplest update is of parameters is done in equation \eqref{updateParam}
\begin{equation} \label{updateParam} \left\{ \begin{matrix} W^{(l,l1)} &\longleftarrow& W^{(l,l1)} && \eta_l \nabla_{W^{(l,l1)}}\\ b^{(l)} &\longleftarrow& b^{(l)} && \eta_l \nabla_{b^{(l)}} \end{matrix} \right. \end{equation}See how this is performed in
joml
in aNetwork
, in aLayer
and byGradientDescent
and byAdam
.
Run the wheel: a simplistic toyÂ example
Letâs say that we are given a data set that represents a function that takes in
14 binary digits and output one of four numbers (\(\{0,1,2,3\}\)), using joml
we can easily construct neural network to train to learn thisÂ function:
The output vectors \((y^{(j)})_j\) are onehot vectors, i.e vector suchÂ that:
Letâs say that we want to construct three networks likeÂ this:
14100404 net
: \(1\) hidden layer of size \(100\), \(1\) hidden layer of size \(40\) (\(5704\)Â parameters)1428x64 net
: \(28\) hidden layer of size \(6\) (\(4596\)Â parameters)1414x284 net
: \(14\) hidden layer of size \(14\) (\(5940\)Â parameters)
each network havings the cross entropy as the cost function on top of the softmax function for the training. All but the last fully connected layers have ReLU as the activationÂ function.
Using joml
, we can define those networks asÂ follows:
import numpy as np
from joml.network import Network
from joml.layer import Layer, SoftMaxCrossEntropyOutputLayer
n1 = Network(input_size=14, name="14100404 net")
n1.stack(Layer(size=100))
n1.stack(Layer(size=40))
n1.output(SoftMaxCrossEntropyOutputLayer(size=4))
n2 = Network(input_size=14, name="1428x64 net")
for _ in range(6):
n2.stack(Layer(size=28))
n2.output(SoftMaxCrossEntropyOutputLayer(size=4))
n3 = Network(input_size=14, name="1414x284 net")
for _ in range(28):
n3.stack(Layer(size=14))
n3.output(SoftMaxCrossEntropyOutputLayer(size=4))
Training and TestingÂ Set
The dataset used consist of all the elements of \(\{1,1\}^{14}\), namely 16384 vectors. \(80\%\) of this space is used to train, the remaining \(20\%\) of it toÂ test.
Train  Test  

Samples  \(13107\)  \(3277\) 
Proportion  \(80\%\)  \(20\%\) 
To train those architectures, Adam have been used with the default parameters given in the original paper, that is \(\beta_1=0.9\), \(\beta_2=0.999\), \(\epsilon=10^{8}\) and a learning rate of \(\eta=0.002\). We are using minibatch of 32 example, and 10Â epochs.
Using joml
, performing the training on the 3 networks can be done with thisÂ snippet:
for network in [n1, n2, n3]:
network.train(x_train, y_train)
y_pred, y_hat, acc = network.test(x_test, y_test)
But one can also perform a benchmark with jolm
to better see andÂ understand
for network in [n1, n2, n3]:
# Define a log file
current_datetime = strftime("%Y%m%d%H:%M:%S", gmtime())
logs_folder = os.path.join("..", "logs")
out_file = os.path.join(logs_folder, f"benchmark_{network.name}{current_datetime}.csv")
# Benchmark the network
logger = network.benchmark(x_train,
y_train,
x_test,
y_test,
csv_file_name=out_file,
num_epochs=10)
# Dump results in a CSV file
logger.dump_results()
logger.plot_benchmark()
Results andÂ discussions
Firstly the networks succeed to learn the given mapping \(f:\{1,1\}^{14} \to \{1,2,3,4\}\): test accuracies of more than \(90\%\) are observed using the given procedure for all the networks (see the table bellow). The training cost is noisy: this is normal as we are using MiniBatch instead of the entire dataset to train the network for oneÂ iteration.
Architecture  Training Cost  Testing Cost  Training Accuracy  Testing Accuracy 

14100404 net 
0.0804  1.2273  1.0  0.9774 
1428x64 net 
5.3491  4.5165  0.8947  0.9150 
1414x284 net 
1.166  3.10  0.9736  0.9340 
The three following figures, show respectively
the cost and the accuracy for the 14100404 net
, 1428x64 net
,
1414x284 net
architectures over the iterations ^{9}.
Indeed, using the entire dataset would give a better estimate of the cost \(H\) and thus a good gradient and a hopefully (and idealy monotonically strictly) decreasing training cost. But this is computational demanding, and generally faster computations are preferred even if they do not give the best results: on the other hand, one sample at the time can be used to get faster computations, but this would give a really poor estimate of the cost function; thus resulting in having gradients that are not colinear to the real gradients of the cost (i.e. the one pointing to the steepest direction). Using minibatches is in between those two approaches, namely Batch Gradient Descent (that uses the entire set) and Stochastic Gradient Descent (that uses only one random sample): it is still quite fast (as fewer example get evaluated instead of all the examples of the training set) but a reasonably good estimate of the cost can be obtained (thus resulting in a generally decreasing cost) that is definitively better than in SGD. Hence, this explains the irregular training costs and accuracies observed for the threeÂ architectures.
Moreover, the shallowest architectures are also the ones that are the easiest
to train: the first 14100404 net
has a every iteration a better
accuracy than the second 1428x64 net
; the same observation between
the second and the last 1414x284 net
holds. Some parameters tuning
has been needed for the lastÂ architecture.
This might be intuitively â no yet theoretically â explainable because adapting fewer weights (i.e. optimising a function with less variables) is a easier problem. Though, more theoretical arguments should be given^{10}.
Finally, for the three architectures the testing accuracies seem to follow their associated training accuracies: this means that new example are as likely to be correctly classified as examples that are used to train those architectures. However, those claims should be verified: it might be possible that the testing accuracy drop after more epochs, meaning that the associated network is overfitting the data ^{11}.
InÂ brief
In this article â that I would have wanted it to be shorter â we have seen the
base foundation of multilayer perceptron and itâs optimisation procedure.
joml
has
also been presented as a way to see how could such networks be implementedÂ minimalistically.
Bonus Track â Equivalence of two NeuralÂ Networks
Two networks are said to be equivalent if, they have the same number of input and output nodes, for all inputs, the outputs of both networks are identical. Given two neural networks as shown below, all activation functions \(\sigma\) are identityÂ functions:
Letâs say that we are here given a neural network \(\mathcal{N}_A\) of three hidden layers; we thus have the following pairs ofÂ parameters:
Each layer has the same size and the same activation fonction defined to be the identity function,Â i.e:
We want to construct an equivalent network \(\mathcal{N}_B\) consisting of a simple layer having a single pair ofÂ parameters:
\(\mathcal{N}_A\) and \(\mathcal{N}_B\) are equivalent, if they share the same number of inputs and outputs and if they give the same output for an identical input,Â i.e:
The following values hold for all \(x\):
As \(\sigma = \text{Id}_{\mathbb{R}^N}\):
Distributing products, we finallyÂ get:
Thus, by identifying the terms, weÂ get:
If the weights and biases are given in the first network with two hidden layers, we can write a program to find the weights and biases of the second network with no hidden layers to make the two networksÂ equivalent.
 Expecially in France, where the acamedic communities do prefer having solid theoretical foundations. It is changing with for example StĂ©phane Mallat giving lectures about his research on a mathematical understanding of such systems. â©
 Like Andrewâs Ng, that give a broad overview of the topic. â©
 Yes. â©
 See this Stack Exchange thread for a quick overview â©
 See this article that covers optimisation methods for Deep Learning â©
 See the original article: Adam: A Method for Stochastic Optimization â©
 This is done per pair \((x^{(i)}, y^{(i)})\), but we do not indicate it here for the sake of notation. â©
 Actually, means of gradients associated to each pair \((x^{(i)}, y^{(i)})\) are used for the updates, but again we do not indicate it here to make it readable. â©
 An iteration is defined as the processing of oneÂ minibatch. â©
 A relation between the number of parameters and the training easiness of the same network with to the number of training examples is a result I am currently looking for. StĂ©phane Mallat gives a result about it (in French unfortunately)Â â©
 Such a verification could be done by denoising the training cost and keeping track of the gap between the training cost and the testing cost. Some methods â like early stopping â use this metric to avoid to train network that overfit the trainingÂ data â©