Disponible en Español

Course on Machine Learning and Central Banking

Digital Course. November 16 – 20, 2020.

During the first edition of this Course, held virtually in collaboration with Deutsche Bundesbank and the Banco Central de Costa Rica, participants were able to learn the fundamental concepts regarding to Machine Learning and how Central Bank issues can be solved through this set of techniques. In addition to conceptual sessions, the Course had hands-on exercises where it was shown how to implement the concepts in the programming language R.

Day 1

The first day started with a presentation by Professor Stefan Bender with an overview on Machine Learning (ML) and how it can be a reliable tool for central banks to use within their different areas to solve multiple problems related to aspects of payment systems and financial stability, among others. It was mentioned the general process to develop a system based in ML which is: i) to understand the business problem, ii) map the original problem to a ML problem, iii) understand the data being used, iv) explore and prepare (preprocess) the data, v) select the best method, v) evaluation and vi) deployment. Then it was reviewed the main types of ML which are supervised, semi-supervised and unsupervised. Some examples were mentioned on of each of them like clustering algorithms and classification methods. Finally he mentioned the main factors to consider when implementing these procedures, which are complexity, overfitting, robustness, interpretability and training and testing time.

The following session was devoted to showing a use case where the Deutsche Bundesbank developed a methodology based on ML to find links between records related to the same entity on databases from different sources, given that unique identifiers were not available to be directly linked. The development of such system can provide huge benefits related to information analysis. The process comprised these steps: preprocessing of the databases, reduction of search space, comparison of records, classification (decide if there or there isn´t a link) and evaluation of the results. The development also presented a challenge regarding privacy issues.

Next, it was reviewed the splitting data process common in ML. It is a golden rule in ML to evaluate models on data that was not used to train them. To this end, it is necessary to divide (split) the whole data into different sets, this partition is almost always done in a random manner and each set have different purposes. The training set is used to train the model where it learns the underlying data patterns, then the validation set is used to try different model configurations and to select the one with the best performance, finally the test data is used to see how well a model performs on unseen data.

In this session it was also reviewed the concept of cross-validation, which is an exhaustive procedure that tries different training and validation sets. Then it was shown the concept of confusion matrix which involves the accounting of true positives, true negatives, false positives and false negatives which along with other measures as accuracy, recall, precision and f1-score serve for the validation and evaluation of ML models.

The day ended with a series of hands-on exercises about the main R commands.

Day 2

The second day was devoted to deepen on shrinkage methods which are variations of the ordinal least squares (OLS) regression.

The presentation started with the main motivations that drives the use of these techniques, which are the handling of the curse of dimensionality, the reduction of over-parametrization and overfitting, and the reduction of computational resources. It was also mentioned that the core idea can be also found in econometric techniques as partial least squares and principal components regression.

The session continued with a review on ridge regression. It was noted that the difference with the OLS regression relies on the addition of a new term to the objective function that impose a penalty on coefficients’ magnitude, this term makes use of the L2 norm. It was mentioned that this subtle difference has as consequence a reduction in the value of regression coefficients’ and an improvement in the prediction of unseen values.

Then the lasso regression was reviewed. As in the case of the ridge regression, in lasso a penalization term is added to the objective function, but in this case the term makes use of the Manhattan distance. The new term forces to the least important regression coefficients’ to be zero, which is equivalent to exclude the variables from the model. This technique presents an improvement in the prediction, and has a computational advantage since it forces some coefficients to be zero.

Afterwards, the elastic net technique was explained. It was observed that sometimes the ridge regression suffers from a grouping effect, this is, strongly correlated variables tend to be in or out the model together. In other to avoid this problem, elastic net makes a convex combination of ridge and lasso penalization terms. Finally, an example of classification of multiple classes was presented, where the performance of the shrinkage methods was compared.

The day concluded with a series of hands-on exercises in R.

Day 3

The Course’s third day was devoted to study the decision trees methods, how these are implemented and some examples. Also, the first part of the ensemble methods topic was introduced.

The day started with an introduction to decision trees, which are a set of methods that belong to the category of greedy algorithms that make recursive partitions on the data in order to generate subsets that predominantly belong to one value of the dependent variable. The focus of the session was on the classification trees i.e. the dependent variable is a categorical variable.

The session continued by presenting an example on how a decision tree might look, followed by the fundamental steps to consider in order to create it, these are the split criterion, how to select the independent variables for the splits, the depth the tree should have and which value should be predicted in each part of the tree. Then it was discussed in depth each of the fundamental steps, first it was mentioned the main split criterions: CART, minimum value and maximum value; next how the method finds the independent variable and the corresponding cut-off through the split; then stop and pre-pruning criterions to control the deepness of a tree and avoid overfitting were presented; and finally three different alternatives to generate predictions at each leaf node were discussed, which were majority voting and predicted probabilities.

Then, it was presented that one of the main goals in Machine Learning is to find models that reach the optimal point of minimization of both bias and variance. An option to achieve this goal is to use cross-validation to estimate the validation error and test the model with new observations. Another possibility is to combine many models to generate averaged predictions, such approach is called ensemble learning. In the last part of the day an introduction on ensemble methods and bagging was given. Ensemble methods combine multiple weak models in order to generate a stronger one. In bagging, the idea is to generate multiple datasets from the original to estimate a new model on each one, and then to aggregate the predictions of every model.

The day finalized with hands-on exercises of tree techniques implementations in R.

Day 4

The fourth day was devoted to the deepening on two ensemble methods that uses as base model decision trees, random forest and gradient boosting.

After a recapitulation on ensemble methods and bagging, the session focused on the analysis of random forests, for which the main idea is first to generate a number of new datasets (through re-sampling), then train a tree for each data set by only considering a proportion of the whole set of independent variables, this to introduce randomness in the methodology, and finally to aggregate the predictions of each tree to generate the final prediction. Then, it was mentioned that although random forests reduces the variance due to its ensemble nature, it is needed to put attention on the search of the optimal set of parameters for the model; this is to avoid high variance.

The session continued with an implementation of random forests in R

Then a revision on gradient boosting was done. The main idea behind this technique is to sequentially train new models, giving more importance to the observations that are difficult to predict, the level of difficulty is reflected on weights or residuals that are linked to each observation. For this case, unlike the random forest, the dataset used for each model is the same, and only the weights changes on each iteration. That gradient boosting is often applied in the context of trees, but any weak model can be used. It was also observed, on the one hand, that this technique can solve any type of problem, as long as the gradient of the loss function involved can be calculated. On the other hand, one should consider that the implementation must be accompanied by a fine hyper-parameters tuning in order to control the bias, and that for this case is required more computation time compared with the rest of the techniques seen.

Day ended with a hands-on exercise on gradient boosting in R.

Day 5

The last day of the Course was devoted to present three main Machine Learning projects developed by CEMLA along with regional central banks and University College London (UCL), this followed by an overview of the topics given during the Course and the participants’ final comments.