Embracing CI/CD workflows for building ETL pipelines

how we will gather and monitor multi-source spatially-interpolated meteorological parameters in near-real time

10:0011/11/2023

Up-to date measurements of surface meteorological variables are essential to monitor weather conditions, their spatio-temporal variability and the potential effects on a wide range of sectors and applications. Moreover, when included in continuous records of long historical observations spanning several decades, they become essential for assessing long-term climate variability and change locally and on a regional level.

Automated pipelines capable of retrieving and processing near-real time meteorological data satisfy the primary prerequisites towards the development and advancement of effective and operational climate services.

With a public and operational near real-time monitoring web platform in mind, we present automated pipelines to collect and process up-to-date daily temperature and precipitation records for Trentino South Tyrol (Italy) and surrounding areas, and to derive their spatially interpolated fields at sub-km scale. Our pipelines are composed by multiple steps including data download, sanity checks, reconstruction of missing daily records, integration into the historical archive, spatial interpolation and publication onto online FAIR catalogues as (openEO) “datacubes”. The different APIs, data formats and structure across the various data sources, and the need to merge the data onto harmonized meteorological layers, make this a typical case of the so-called Extract, Transform and Load (ETL) pipelines, and, in order to follow the principles of data reproducibility and Open Science, we embraced open-source automated workflow management through GitLab’s Continuous Integration / Continuous Development (CI/CD) capabilities.

CI/CD workflows greatly help the management of the relatively complex graphs of tasks required for our climate application, ensuring seamless orchestration with thorough flow monitoring, application logs, transactions rollbacks, and exception handling in general. Native pipeline-oriented software development also fosters a clean separation of roles among the tasks, and a more modular architecture. This effectively reduces barriers to collaborative development and paves the way for robust operational climate services for researchers and decision makers in the face of the changing climate.

Video

Presentations