Having fun with weather data

Weather data is one of the most interesting data sources that are available to a Data Scientist.

One of the first ETLs that I ever coded when starting to work as a Data Scientist was a daily extraction of current and predicted weather data. This data would later be processed into features for a model predicting brick-and-mortar shop sales.

Weather data is easily available. If you are based in Catalonia, for example, meteo.cat is the Catalan government agency and has a very good REST API plus a very nice network of weather stations. The equivalent for the Spanish government is AEMET, also with an open REST API.

Other countries also have their own agencies, such as the Met Office for the UK or Météo France. Finally, when in need for worldwide data, OpenWeather also offers an API which is a great choice.

All of the above either offer free access in exchange for fair use, sometimes preceded by registration, or a pricing model for data integration into commercial applications.

How much water in your pluviometer?

There are two limitations that appear soon after you start working with weather data:

  • Historic data going back several years is usually not available.
  • Observations are drawn from a network of stations that are often quite apart from each other.

On one hand, historic data is important as training data.

On the other hand, proximity to the weather station is not important for many applications. For example, temperature and atmospheric pressure tend to be quite stable on a given geographic area. Rainfall is otherwise incredibly volatile, at least in the corner of the Mediterranean where I live. It could be pouring down at home while the vegetable plot, a couple of kilometers away, stays absolutely dry.

Taking rainfall into account is critical for making agricultural decisions. Of course, we need to know rainfall prior to deciding how much water needs to be supplied to each crop. More subtly, rainfall is related to the growth of common plagues, so tracking it is important in terms of deciding when to apply treatments.

For lack of a better option, after a rainy day, it is very common to run to the pluviometer in the garden, plot or vineyard in order to find out how much rain we had last night, and the amount will be brought up in conversation, often noting the differences between close locations.

Being passionate about data and having an itch to automate things, it was only a matter of time before I started to look for close stations to track local conditions.

That’s how I came across meteobeguda.cat, a website exposing data from a privately owned weather station located in La Beguda Alta, only 3 kilometers away from home. After seeing that current data is downloadable as a text file from the website and obtaining permission from the owner, I set out to code a small ETL as a means to have both a backup of historic weather data and a means to easily access data for my own tasks.

Designing a very small ETL

A single weather station will typically register a single data point with instrument measures each 15 minutes. This adds up to 96 daily data points. By current standards, this is very small data. Therefore, there is no need for powerful cloud computing, big data warehouses and spreading workloads across large clusters.

For this reason, I went for the cheap options on the menu for building an ETL:

  • A python script without any parallelization, using requests to fetch and pandas to process new data. While I ensured that the job would be idempotent, I added very limited support for lost run recovery.

  • Storing extracted data in three formats by committing files to the same repository:

    • Raw: a copy of the original text file without any changes. Mostly for reproducibility purposes.

    • Parsed: the data for a given date, processed into an individual CSV file per date. For this type of data, I would normally have a strong preference for the Apache Parquet format, which has many advantages. However, CSV is a text-based future-proof format. In other words: in 30 years time, I am a bit more certain about our ability to read CSV than Parquet.

    • Processed: a SQLite database saved to a file, having a raw data table with individual data points each 15min from 2020 onwards, plus hourly, daily and monthly aggregated views. This is the format that enables easy and quick access to clean data via SQL queries. DuckDB would have been an excellent alternative at the cost of adding an extra dependency to the project; the KISS principle dictated that SQLite was the better choice.

  • Batch execution via Github Action. In other words, for such a small workload, the Github Actions runners provide free CPU time in an amount which is sufficient to allow not to pay for a CPU elsewhere.

I also committed historic data to the repository, for example as several small SQLite files.

The data parser

All of the code is as simple as possible and fits expectations with the exception of the data parser, which is perhaps the only part deserving a bit of explanation.

The beginning of a raw data file looks like this.

                 Temp.  Temp.  Temp.  Hum.   Punt   Vel.   Dir.   Rec.  Vel.  Dir.  Sens.   Índ.   Índ.                  Int. Graus D.Graus D.  Temp.  Hum. Rosada  In.Cal.  EMC     Dens.    Mostr   Tx   Recep.  Int.
  Data    Hora    Ext.   Màx.   Mín.  Ext. Rosada   Vent   Vent   Vent  Màx.  Màx.  Tèrm.  Calor   THW    Bar    Pluja   Pluja  Calor    Fred    Int.  Int.   Int.   Int.    Int.  Int.Aire    Vent   Vent   ISS   Arc.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
24/06/25   0:15   23.6   23.7   23.4    60   15.3    1.6    OSO   0.40   4.8     O   23.6   24.0   24.0  1016.1   0.00    0.0    0.000   0.054   28.9    47   16.5   29.7    8.57   1.1507     348     1    100.0   15 
24/06/25   0:30   23.7   23.8   23.7    60   15.5    1.6    OSO   0.40   4.8     O   23.7   24.2   24.2  1016.0   0.00    0.0    0.000   0.056   28.9    47   16.4   29.7    8.57   1.1509     352     1    100.0   15 
24/06/25   0:45   23.6   23.7   23.4    61   15.6    1.6     NO   0.40   4.8    NO   23.6   24.1   24.1  1016.1   0.00    0.0    0.000   0.055   28.8    47   16.4   29.6    8.57   1.1513     351     1    100.0   15

By squinting your eyes, after discarding the initial rows and assuming that column order will be preserved, this raw file looks like something that could be fed into a CSV parser. By exploiting the many parameter options in pandas.read_csv while insisting on being harsh about column data types, indeed, it is possible:

from io import BytesIO

import pandas as pd
import numpy as np


COLUMNS = {
    "date": np.dtype("O"),
    "time": np.dtype("O"),
    "temperature": pd.Float64Dtype(),
    "temperature_max": pd.Float64Dtype(),
    "temperature_min": pd.Float64Dtype(),
    "humidity": pd.Int64Dtype(),
    "dew": pd.Float64Dtype(),
    "windspeed": pd.Float64Dtype(),
    "wind_direction": pd.StringDtype(),
    "wind_rec": pd.Float64Dtype(),
    "windspeed_max": pd.Float64Dtype(),
    "windspeed_max_direction": pd.StringDtype(),
    "temperature_feeling": pd.Float64Dtype(),
    "heat_index": pd.Float64Dtype(),
    "thw_index": pd.Float64Dtype(),
    "pressure": pd.Float64Dtype(),
    "rain": pd.Float64Dtype(),
    "rain_intensity": pd.Float64Dtype(),
    "heat_degrees": pd.Float64Dtype(),
    "cold_degrees": pd.Float64Dtype(),
    "temperature_interior": pd.Float64Dtype(),
    "humidity_interior": pd.Int64Dtype(),
    "dew_interior": pd.Float64Dtype(),
    "heat_index_interior": pd.Float64Dtype(),
    "air_density_interior": pd.Float64Dtype(),
    "wind_direction_degrees": pd.Float64Dtype(),
    "tx_wind": pd.Int64Dtype(),
    "iss_reception": pd.Int64Dtype(),
    "arc_interior": pd.Float64Dtype(),
}


def load_bytes(raw: bytes) -> pd.DataFrame:
    return pd.read_csv(
        BytesIO(raw),
        sep=r"\s+",
        header=None,
        skiprows=3,
        encoding="latin1",
        names=COLUMNS.keys(),
        index_col=False,
        dtype=COLUMNS,
        na_values=["---", "------"],
    )

Results

This ETL has been working since Spring 2022 on a daily basis and has been surprisingly robust.

On the odd day that the weather station is offline at the time of extraction, it is possible to manually re-run the failed pipeline in order to catch up for missing data. If this is the case, Github will send me a notification email and I will re-run the action at my convenience. There is an 8-day limit to recover missing data, as it will stop being available on the website.

On a couple of rare occasions there has been slight data corruption related to mismatches between the raw txt file and the data parser. These have been treated as bugs and fixed after adding the relevant unit tests. From looking at the Github commit history, one could say that migrating python package managers, from poetry to rye to uv), has been more of a headache than fixing bugs.

The nature of this solution obeys the conditions of going very cheap into coding a hobby ETL for a very small amount of data. I would carry very little of what has been implemented in this project to a professional production environment, with perhaps the exception of choosing a very simple initial architecture and then introducing complexity by small incremental steps as dictated by need.

A few obvious more serious choices:

  • Replacing the use of requests by aiohttp in order to allow asynchronous parallelism. This would be a good idea if we were extracting data from many different stations simultaneously.

  • Replacing the use of pandas by more performant data processing options such as polars or even Apache Spark. The latter is a good idea if the volume of data to be processed advises moving to a distributed cluster.

  • Replacing the use of an on-disk sqlite database by a SQL database, such as PostgreSQL, with dedicated hosting. This would be useful in a variety of situations such as for example having multiple users querying clean data.

  • Replacing the use of Github Actions as a scheduler by a choice of a proper task scheduler such as Apache Airflow. This allows for reproducibility of batch runs and automatic re-runs in case of failure, and the introduction of subtle logic associated to running tasks as a DAG.

Conclusion

Nowadays, after it rains, I have a choice:

  • Going to the pluviometer.

  • Throwing a SQL query against a SQLite database.


Header photo by Alex Dukhanov on Unsplash.