Disponible en Español

III Course on Machine Learning and Central Banking

September 12 - 15, 2022
Digital Format

 

Eventos 2022

The course was held in digital format, from September 12 to 15, 2022. The course aimed to present the building blocks of Machine Learning and analyze selected methods, establishing connections between them and conventional statistical methods. The event also addressed the practical challenges associated with their adoption, providing a forum for participants to present and discuss strategies to develop and implement Machine Learning models, enabling an exchange of knowledge among countries on this increasingly important topic.

The course had the welcoming words of Dr. Gerardo Hernández del Valle, who highlighted the automation capabilities of machine learning, but mentioning that they should be used as a complement to human tasks, not a substitute.

Day 1

The first session gave an introduction to the course. The speakers started emphasizing that machine learning (ML) tools should be used with care, and their goal is not to replace traditional tools, but to complement them. In this line, a model generated by ML should not learn 100%, because that would mean it is memorizing, instead of understanding the relationships between the variables, so it will not be effective in the real world.

It was explained that the data available must be distributed, since the model must be evaluated to see that this memorization does not occur. The standard is to randomly divide the data, taking 60% to train the model, 20% to validate it, and 20% to test it.

The instructors mentioned that the validation data is used to do trial and error with the model, see how it behaves and later make changes, retrain it and repeat this process until they are satisfied. This is a point of no return, because if the model is evaluated with the test data and changes are decided, the model ceases to be unbiased. It is common for ML models not to be able to predict events that have not happened before, and they commented that having a lot of data does not necessarily help, as it can be very similar and not provide more knowledge.

They then explained the cross-validation method, in which the data is randomly divided into K parts, of which one is left for evaluation and the others for training; for example, if it is divided into 4 parts, in the first iteration, part 1 is separated and the model is trained with parts 2, 3 and 4, in the second iteration part 2 is separated and the model is trained with parts 1, 3 and 4, etc., obtaining 4 estimates for the error.

The disadvantage is that you must train the model K times, which can take a lot of time. For a time series, a window that extends in each iteration is used, and the last section of that window is used to evaluate.

The second session continued with the introduction, and they began by explaining evaluation methods, such as confusion matrices, where false positives, false negatives, true negatives, and true positives are compared. Some measures obtained from this are accuracy, sensitivity and specificity.

Accuracy is calculated by dividing the correct predictions by the total. The problem is that many times you have marginalized classes, for example, in loan classification 99% of the time the payment is met, so a model with 99% accuracy would work the same as simply saying that all loans will be paid, because that would also have 99% accuracy. Sensitivity measures the percentage of trues that were correctly predicted, and specificity does the same with false ones.

Then they discussed the F1 test, which balances accuracy with sensitivity, as well as the ROC curve (receiver operating curve), which compares sensitivity versus specificity for different probabilities.

The day concluded with an introduction to R language, as well as some practical examples in it.

Day 2

The second day began with exhibitors presenting Decision Trees. Their purpose is to make the best decision at the present time, and with that divide the problem into simpler parts, in a recursive way. At each point a decision criterion will be formed, for example, whether the total amount of a loan is greater than a certain value or not, with which the data is divided. Each division grows the tree.

Then they explained that a good way to choose the criterion with which the tree will be divided is using the Gini coefficient, since its minimum is when all the observations belong to the same category (purity), and it is maximum when the observations are distributed equally among all the categories. It is sought that the decisions are as clear as possible. The variable to be used will be the one that generates the purest nodes.

Another problem they commented is that, if it is not stopped, the algorithm can generate a leaf (end point of the tree) for each observation, becoming bad at evaluating new data. To avoid this, the tree must be pruned, either once finished or deciding the size of three beforehand.

Then they talked about what to do if a variable used to decide is not available, and that substitute variables are usually taken, which are independent of the missing one but that their division is as similar as possible.

The explanation concluded with some considerations, such as, the Gini coefficient is biased with respect to variables with missing observations, and that the observations must be independent.

The next topic discussed was conditional inference trees, which use the p-value instead of the Gini coefficient, so the overfit is reduced (memorizing the data), they have a natural unemployment criterion, and it does not give more value to variables with missing observations. However, it is more complex, computationally speaking.

In the second session, the dilemma of bias and variance was discussed, since it is very common for complex models to have low bias, but high variance. The idea is to have a model that has a good balance between these measures.

They then explained assembly methods, which seek to combine a series of weak models to produce a stronger and more stable one. Of all the predictions obtained, the average or majority vote can be taken to reach a consensus. However, it is more complicated to interpret your results, because having so many models makes it a black box.

The first method of this type explained was Bagging, which seeks to create small models from the same data set. The idea is to train a model with slightly different databases; to create them, the original data is sampled with repetition. Another method is Boosting, which starts with a simple model, that is analyzed and improved in a second iteration; this process is repeated, improving the original model in each iteration until a stop criterion is reached.

The day concluded with a practical session on the topics seen.

Day 3

The first session of the day started with some problems of the previous day's methods, such as producing very similar predictions to each other. One solution is to use Random Forests, which hides some variables from the tree, so you have the opportunity to try different options in each iteration; several trees are created, each with different variables.

They explained how this method reduces variance, but can increase predisposition, as biased separations can be created, and are more difficult to interpret.

The next method explained was Causal Forests, which are used to make cause-and-effect relationships. Instead of using random trees as before, causal trees are used, in which separations are made based on having nodes with a constant treatment, but different from other leaves.

In the second session they commented on gradient boosting, which is used to optimize a loss function, for which the gradient of this function is calculated. They also mentioned that using it individually can result in simple models, but iterating it gives more robust results. You have to be careful with the size of the steps in each iteration, because the method is moving in optimal locals, so if the step is very small there will be no improvement, and if it is very large it will move away from that optimality point. The weak models are usually trees.

The session concluded with a practical exercise of the methods seen.

Day 4

After answering a few questions, in the first session exhibitors discussed support vector machines (SVMs), which create a separation vector between two (or more) classes in a dataset. The support vectors are those that maximize the distance between the classes.

They can be generalized to non-linear problems, rethinking the formulation and the function to be optimized. There is also a variant for doing regressions (SVR). To exemplify this, they tested some variants in R Studio, and compared the results.

In the last session of the course, some real applications were presented, such as modifying a payment database so that researchers can use it without losing information but different enough to avoid privacy problems, that is, that a particular bank cannot be identified from that data. The method used was variational autoencoder with differential privacy. A space was given for questions and answers about the model. Another example of autoencoders is detecting anomalies, such as out-of-the-ordinary payments.

A model used to classify self-proclaimed sustainable funds was then discussed. Unsupervised models were used to detect clusters. In both cases, there were problems with the initial results and adjustments have had to be made, mentioning that this is very common.

They concluded by recalling that machine learning should not replace existing elements, and that there is a large area yet to be explored in the subject.

 

Gerardo Hernández-del-Valle

Centro de Estudios Monetarios Latinoamericanos

El Dr. Gerardo Hernández del Valle es ingeniero eléctrico de profesión, con estudios de posgrado en Probabilidad y Estadística. Al terminar su doctorado en la Universidad de Columbia, en la ciudad de Nueva York, fue profesor de la misma institución, así como consultor en la empresa financiera Algorithmic Trading Management. Al regresar a México en el 2012 se incorporó a la Dirección General de Investigación Económica en donde trabajo en temas de economía y sus interacciones con agentes y activos financieros. Posteriormente trabajó durante varios años como Portafolio Manager y Analista Cuantitativo en la Casa de Bolsa de Actinver, en la Ciudad de México. En mes de enero del presente año, se incorporó al CEMLA.

Gabriela Alves Werb

Deutsche Bundesbank

Gabriela Alves Werb is a professor of Business Information Systems at the Frankfurt University of Applied Sciences. She is affiliated with the Research Data and Service Centre of the Deutsche Bundesbank since August 2020. Prior to her PhD studies at the Goethe University Frankfurt, she led cross-functional projects and teams in various functions at IBM and Hays for several years. Her research interests lie in the interface among marketing, finance and information systems.

Sebastian Seltmann

Deutsche Bundesbank

Sebastian Seltmann is a Data Scientist at the Research Data and Service Centre of Deutsche Bundesbank since 2020. Among his responsibilities is the dataset of large loans in Germany as well as the development of tools and pipelines to improve data workflows.

Before that he was a software engineer and application owner at a private german bank.