Disponible en Español


Seminar on Big Data and Data Science Applications and Development at Central Banks


This seminar was jointly organized by CEMLA and the Banco de España and it was held in virtual format, from June 1-3, 2021. The main objective of the seminar was to constitute a forum of experts with the purpose of presenting some of the advances that have been made in this matter in the central banks and to promote the exchange of experiences and ideas among them.

In recent years, the use of Big Data and Data Science has been observed more frequently, in different areas of activities of central banks, such as accounting, administrative, communication, economic and financial. In the latter two, its use is intended to contribute to a better monitoring of economic and financial activity, with timely indicators of its evolution and early warning indicators of possible risks; as well as indicators that allow authorities to evaluate the impact of their policy decisions on the economy and the financial system.

The seminar had an initial presentation by an IMF official, who provided a general framework on the relevance that the digital economy has acquired as a generator of huge volumes of digital information. It was highlighted how the different Internet platforms have become new information sources with very relevant characteristics in terms of coverage, granularity, geographic reference, variety of variables, among others; which capture the behavior of consumers, companies, financial institutions and government entities. Some challenges faced by the use of Big Data were mentioned, among which it is worth mentioning the importance of identifying and establishing “best practices” in terms of the statistical methodologies and techniques used, in order to achieve the quality, precision and timeliness required from the information obtained. The other presentations that made up the seminar's agenda covered different aspects that were grouped into four main topics that include work developed in various areas of central banks.

Applications of data science in statistical output and quality control
The Banco de México presentation highlighted how the use of online price information has grown, both for price measurement and for use in research. The behavior of the prices of goods traded by retail companies that operate both in physical stores and online sales is analyzed. Some of their results indicate that the prices of the goods considered in the study change more frequently in stores compared to their online channels; for a given price change, the magnitude of the change is greater in the online channel than in the store; these results were not affected by the presence of the COVID-19 pandemic. The Banco de España presented the results of two studies: i) using the Family Financial Survey, it presented the results of a study whose objective is to find a statistical classification model that is capable of predicting whether there is a need to re-contact a household interviewed, to ask again about some key parts of the survey, in order to avoid discarding questionnaires in their entirety, and to maintain the representativeness of the sample and the quality of the final data. Preliminary results suggest that, by applying machine learning techniques, a robust methodology can be found that is capable of generating a re-contact score, contributing to a more efficient work of the review team; and, ii) A use case in the Central de Balances (Central of Firms´ Balance Sheets). The objective is to use machine learning techniques for the classification and debugging of questionnaires; and to carry out imputations of missing values. The results indicate that it is feasible to design algorithms that allow achieving those objectives. Likewise, one of the lessons learned is that the knowledge of accounting experts should be incorporated into the design of the algorithms. The Banco Central del Uruguay presented what it identified as the first steps and challenges for deepening the use of Data Science in the statistics of national accounts in Uruguay. The work they are doing includes the evaluation of various data sources such as the electronic invoice of the General Directorate of Taxes, Google Trends and mobility reports; the generation of data through the Webscrapping technique; the acquisition of experience in the handling and processing of large amounts of data; and, the establishment of a minimal and automated infrastructure. Their main aim is to meet the needs for high-frequency and real-time information for decision-making, in addition to meeting the challenges posed by the COVID-19 pandemic. The Banco Central de Chile presented a paper on the automatic classification of entry descriptions (“glosas”) in firms´ balance sheets. Using data processing techniques and machine learning algorithms, they obtain an automatic classification of company expenses in various categories of goods and services, which are used in updating sectoral production functions. The resulting cost database is serving as a reference in the process of evaluation of the production functions that are being developed within the framework of the compilation of Chile's national accounts based on 2018.

Applications of natural language processing techniques
The Banco de México presented the results of a study that uses data analytics and machine learning tools to generate indicators that allow monitoring the evolution of labor demand in Mexico. They develop two indicators: The Index of Jobs Print Ad, with a quarterly frequency; and, the Index of Jobs Electronic Ads, with weekly frequency. The results obtained show a decreasing trend in the use of print media for advertisements of available jobs, with cyclical fluctuations around the trend. The Banco de España made presentations of three studies: i) Developing of a new sentiment indicator based on newspaper reports (DENSI). Confidence indicators are used in forecasting exercises of the evolution of economic activity. In Spain they use the Economic Sentiment Indicator (ESI) published by the European Community in carrying out those exercises. The COVID-19 pandemic affected the effectiveness of the ESI, which led to the need for accurate and more timely indicators to predict economic activity in the short term (nowcast). The results obtained indicate that the new DENSI indicator is better than the ESI as a leading indicator of economic activity in Spain; ii) Application of text mining to climate change risk data analysis. The Task Force on Climate-Realted Financial Disclosure (TCFD) developed a series of recommendations with the objective of promoting the disclosure of information in firms´ annual financial statements on the potential impacts of climate change that allows investors to assess risks and opportunities related to climate. The study presents the development, through the use of natural language reading and machine learning techniques, of an index of compliance with the TCFD recommendations by firms. The results indicate that the firms have made progress in complying with the recommendations; iii) Applications of data science in central bank communication. This paper seeks to quantify two aspects of central bank communication: a) the attention paid by the central bank to international affairs; and, b) the alignment of interests between the central bank and the market. They use the press conferences of the European Central Bank (ECB) and the US Federal Reserve (FED), distinguishing between the subject of the executive summary of the conference (central bank) and questions from journalists (market). Preliminary results suggest that the ECB is more oriented towards international issues compared to the Fed; and, in conjunctural situations, the messages that central banks want to convey coincide with those of market interest, but sometimes the market shows other interests. The Banco Central de Chile presented the preliminary results of the development of a sentiment index based on press news (IS-NEWS) whose objective is to have a real-time indicator that complements the “traditional” statistics in the analysis of economic activity. They used webscraping and text mining techniques in developing the indicator. The IS-NEWS shows high correlations with consumer and business confidence indices, as well as with global (GDP) and sectoral economic activity indicators. Likewise, it anticipates economic shocks in the Chilean economy in a period of around 3 to 4 weeks. The presentation of the Banco Central de Costa Rica focused on the use of microdata in the generation of macroeconomic indicators. In recent years, in Costa Rica microdata databases on various aspects of some economic variables has been developed. This has made it possible to obtain information classified by institutional sector, activity sector, company size, export regime, number of workers, salaries paid, geo-referencing, among other characteristics. Through the interrelation of the microdata bases, it has been possible to carry out studies such as: the effects of joining the supply chains of multinationals and the regionalization of the Costa Rican input-product matrix, among others.

Data science developments in data labs
The Banco Central de Chile (BCCH) made a presentation about its adoption process of the Big Data platform. The BCCH receives two tax documents that correspond to the largest administrative record databases that they have received: The Electronic Invoice, which contains detailed information on transactions between all companies; and, the Bill of Sale of Goods and Services, which allows to know the details of the sales to the final consumers. The information available opens the door to the compilation of new and better statistics; and to the strengthening of research activity. In order to have the tools for the storage and exploitation of large volumes of data, they prepared a development and implementation program in stages to be covered in the period July 2021-third quarter of 2022. The Banco de España presented on the use of software tools for confidentiality and output control. Due to national privacy laws, microdata that allow the re-identification of individuals or companies cannot be disclosed, as this would imply the disclosure of confidential information. The objective of statistical disclosure control is to minimize the risk of disclosure, while maximizing the usefulness of the information when publishing microdata or tabular data. They presented the results of their confidentiality control exercise, highlighting that software packages are essential for the anonymization of data sets.  The Banco Central do Brasil presented its S-LAB project, which is aimed at organizing data science processes. This process involved the creation of Datalabs, the availability of development platforms, the creation of an Analytical Intelligence Laboratory, the availability of a dedicated server for analytical intelligence applications, and an employee training program.

Applications in other areas of central banking
The Banco de España presented three studies: i) use of the machine learning technique in supervision activities. In particular, they developed a tool for the treatment and review of files (TyREX). The results indicate that, through the use of algorithms, the identification of files that are not complying with certain rules can be automated. Although this tool does not replace the supervisory analyst, it does increase productivity by serving as the basis for obtaining evidence of non-compliance by supervised entities; ii) The aim was to build a tool that would be useful for forecasting GDP in the context of the pandemic. To do this, they construct indicators of restrictions and mobility in each autonomous region (RA) based on press news. They select a set of demand, productive activity and foreign sector indicators for each RA, and the national indicator is calculated as a weighted average of the regional indicators. With these indicators, they estimate the relationship between mobility and the fall in economic activity during the pandemic. The results indicate that in 2021 mobility almost perfectly explains the behavior of economic activity. Likewise, based on scenarios on the evolution of restrictions and mobility, the model generates forecasts for the economic activity indicator, which are then translated into GDP; and, iii) Application of machine learning techniques to the study of bank note quality. There are guidelines established by the European Central Bank (ECB) on the criteria to take into account (in terms of dirt, stains, wrinkles, tears, mutilations, among others) to consider a banknote as "Suitable" or "Not suitable" to remain in circulation. This process of classifying the banknotes by state of use is carried out with "sorting machines". Using machine learning techniques, they develop modules for banknote images analysis that help to verify that the banknote classification being made by sorting machines complies with the guidelines established by the BCE. The results obtained indicate that the software tool developed can be used as a means of controlling the classification carried out by the machines. The objective of the work presented by the Banco Central de la República Dominicana (BCRD) is to establish the impact of economic uncertainty, of an explicit communication of the balance of risks and the decisions of the monetary authorities, on the expectations of economic agents. They use text mining algorithms to carry out the construction of a metric inherent to international uncertainty, as well as to extract the underlying tone in the policy statements issued by the BCRD. The inputs used are high frequency news (daily / hour) and the BCRD's policy releases. The results obtained indicate that the alignment of the central bank's communications with its policy decisions makes it possible to minimize monetary surprises, facilitating the transfer of monetary decisions to the objectives pursued.


Tuesday, June 1st

Opening of the Seminar
CEMLA - Banco de España

Digital economy and Big Data
Gabriel Quirós, International Monetary Fund

Session 1: Applications of data science in statistical output and quality control
Moderator: Diego Solorzano, Banco de México

Stylised facts from consumer prices of multichannel retailers in Mexico
Diego Solorzano, Banco de México

Predicting the need to recontact in household survey data: a machine learning approach
Nicolás Forteza, Banco de España

The role of Data Science in National Accounts statistics. First steps and challenges for its further development in Uruguay
Fernando Barbeito Ruiz Díaz, Banco Central del Uruguay

Automatic classification of entry descriptions (“glosas”) in firms’ balance sheets using machine learning for the updating of sectoral production functions in Chile
Joaquín Pérez, Banco Central de Chile

Machine learning techniques applied to the allocation and quality control of accounting microdata
Natividad Pérez, Banco de España


Wednesday, June 2nd

Session 2: Applications of natural language processing techniques
Moderator: Alberto Urtasun, Banco de España

Job advertisement index based on online newspaper advertisements
León Fernández, Banco de México

A new sentiment indicator based on newspaper reports. Its use during the current crisis
Matías Pacce, Banco de España

Use of press releases as an indicator of activities in real time
Mª del Pilar Cruz Novoa, Banco Central de Chile
Hugo Peralta, Banco Central de Chile
Juan Pablo Cova, Banco Central de Chile

Application of text mining to climate change risk data analysis
Teresa Caminero, Banco de España
Ángel Iván Moreno, Banco de España

Applications of data science in central bank communication
Marina Diakonova, Banco de España

Use of microdata to generate macroeconomic indicators
Carlos Brenes Soto, Banco Central de Costa Rica


Thursday, June 3rd

Session 3: Data science developments in data labs
Modera: Manuel Ortega, Banco de España

Adoption of a Big Data platform in the Central Bank of Chile
Viviana Rosales, Banco Central de Chile

Software tools for confidentiality and output control
Eugenia Koblents, Banco de España

Organisation of data science processes: S-Lab project
Marco F. Rocha Menezes, Banco Central do Brasil

Session 4: Applications in other areas of central banking
Moderator: Lisette J. Santana, Banco Central de la República Dominicana

Dossier processing and review (TyREX)
Bruno Coutinho, Banco de España
Pablo Yoldi, Banco de España

Uncertainty, monetary policy management and entropy of macroeconomic expectations: An approach based on text mining algorithms and neuronal networks
Lisette J. Santana, Banco Central de la República Dominicana
Juan Quiñonez Wu, Banco Central de la República Dominicana

Relationship between pandemic lockdown measures, mobility and economic activity
Samuel Hurtado, Banco de España
José Luis Herrera, Banco de España

Application of machine learning techniques to the study of bank note quality
Eduardo Kropnick, Banco de España

Closing of the Seminar


Banco de España
Nicolás Forteza
Natividad Pérez
Matías Pacce
Teresa Caminero
Ángel Iván Moreno
Marina Diakonova
Eugenia Koblents
Bruno Coutinho
Pablo Yoldi
Samuel Hurtado
José Luis Herrera
Eduardo Kropnic

Banco Central do Brasil
Marco F. Rocha Menezes

Banco Central de Chile
Joaquín Pérez
Ma. del Pilar Cruz Novoa
Hugo Peralta
Juan Pablo Cova
Viviana Rosales

Banco Central de Costa Rica
Carlos Brenes

Banco de México
Diego Solorzano
León Fernández

Banco Central de la República Dominicana
Lisette J. Santana
Juan Quiñonez Wu

Banco Central del Uruguay
Fernando Barbeito Ruiz Díaz

Fondo Monetario Internacional
Gabriel Quirós


Sesión 1: Diego Solorzano, Banco de México

Sesión 2: Alberto Urtasun, Banco de España

Sesión 3: Manuel Ortega, Banco de España

Sesión 4: Lisette J. Santana, Banco Central de la República Dominicana