Building a very small ETL for local weather data

8 September 2025 pythonweatheretlpandascsvsqlitegithub actions

Having fun with weather data

Weather data is one of the most interesting data sources that are available to a Data Scientist.

One of the first ETLs that I ever coded when starting to work as a Data Scientist was a daily extraction of current and predicted weather data. This data would later be processed into features for a model predicting brick-and-mortar shop sales.

Weather data is easily available. If you are based in Catalonia, for example, meteo.cat is the Catalan government agency and has a very good REST API plus a very nice network of weather stations. The equivalent for the Spanish government is AEMET, also with an open REST API.

Other countries also have their own agencies, such as the Met Office for the UK or Météo France. Finally, when in need for worldwide data, OpenWeather also offers an API which is a great choice.

All of the above either offer free access in exchange for fair use, sometimes preceded by registration, or a pricing model for data integration into commercial applications.

How much water in your pluviometer?

There are two limitations that appear soon after you start working with weather data:

On one hand, historic data is important as training data.

On the other hand, proximity to the weather station is not important for many applications. For example, temperature and atmospheric pressure tend to be quite stable on a given geographic area. Rainfall is otherwise incredibly volatile, at least in the corner of the Mediterranean where I live. It could be pouring down at home while the vegetable plot, a couple of kilometers away, stays absolutely dry.

Taking rainfall into account is critical for making agricultural decisions. Of course, we need to know rainfall prior to deciding how much water needs to be supplied to each crop. More subtly, rainfall is related to the growth of common plagues, so tracking it is important in terms of deciding when to apply treatments.

For lack of a better option, after a rainy day, it is very common to run to the pluviometer in the garden, plot or vineyard in order to find out how much rain we had last night, and the amount will be brought up in conversation, often noting the differences between close locations.

Being passionate about data and having an itch to automate things, it was only a matter of time before I started to look for close stations to track local conditions.

That’s how I came across meteobeguda.cat, a website exposing data from a privately owned weather station located in La Beguda Alta, only 3 kilometers away from home. After seeing that current data is downloadable as a text file from the website and obtaining permission from the owner, I set out to code a small ETL as a means to have both a backup of historic weather data and a means to easily access data for my own tasks.

Designing a very small ETL

A single weather station will typically register a single data point with instrument measures each 15 minutes. This adds up to 96 daily data points. By current standards, this is very small data. Therefore, there is no need for powerful cloud computing, big data warehouses and spreading workloads across large clusters.

For this reason, I went for the cheap options on the menu for building an ETL:

I also committed historic data to the repository, for example as several small SQLite files.

The data parser

All of the code is as simple as possible and fits expectations with the exception of the data parser, which is perhaps the only part deserving a bit of explanation.

The beginning of a raw data file looks like this.

                 Temp.  Temp.  Temp.  Hum.   Punt   Vel.   Dir.   Rec.  Vel.  Dir.  Sens.   Índ.   Índ.                  Int. Graus D.Graus D.  Temp.  Hum. Rosada  In.Cal.  EMC     Dens.    Mostr   Tx   Recep.  Int.
  Data    Hora    Ext.   Màx.   Mín.  Ext. Rosada   Vent   Vent   Vent  Màx.  Màx.  Tèrm.  Calor   THW    Bar    Pluja   Pluja  Calor    Fred    Int.  Int.   Int.   Int.    Int.  Int.Aire    Vent   Vent   ISS   Arc.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
24/06/25   0:15   23.6   23.7   23.4    60   15.3    1.6    OSO   0.40   4.8     O   23.6   24.0   24.0  1016.1   0.00    0.0    0.000   0.054   28.9    47   16.5   29.7    8.57   1.1507     348     1    100.0   15

By squinting your eyes, after discarding the initial rows and assuming that column order will be preserved, this raw file looks like something that could be fed into a CSV parser. By exploiting the many parameter options in pandas.read_csv while insisting on being harsh about column data types, indeed, it is possible:

from io import BytesIO

import pandas as pd
import numpy as np


COLUMNS = {
    "date": np.dtype("O"),
    "time": np.dtype("O"),
    "temperature": pd.Float64Dtype(),
    "temperature_max": pd.Float64Dtype(),
    "temperature_min": pd.Float64Dtype(),
    "humidity": pd.Int64Dtype(),
    "dew": pd.Float64Dtype(),
    "windspeed": pd.Float64Dtype(),
    "wind_direction": pd.StringDtype(),
    "wind_rec": pd.Float64Dtype(),
    "windspeed_max": pd.Float64Dtype(),
    "windspeed_max_direction": pd.StringDtype(),
    "temperature_feeling": pd.Float64Dtype(),
    "heat_index": pd.Float64Dtype(),
    "thw_index": pd.Float64Dtype(),
    "pressure": pd.Float64Dtype(),
    "rain": pd.Float64Dtype(),
    "rain_intensity": pd.Float64Dtype(),
    "heat_degrees": pd.Float64Dtype(),
    "cold_degrees": pd.Float64Dtype(),
    "temperature_interior": pd.Float64Dtype(),
    "humidity_interior": pd.Int64Dtype(),
    "dew_interior": pd.Float64Dtype(),
    "heat_index_interior": pd.Float64Dtype(),
    "air_density_interior": pd.Float64Dtype(),
    "wind_direction_degrees": pd.Float64Dtype(),
    "tx_wind": pd.Int64Dtype(),
    "iss_reception": pd.Int64Dtype(),
    "arc_interior": pd.Float64Dtype(),
}


def load_bytes(raw: bytes) -> pd.DataFrame:
    return pd.read_csv(
        BytesIO(raw),
        sep=r"\s+",
        header=None,
        skiprows=3,
        encoding="latin1",
        names=COLUMNS.keys(),
        index_col=False,
        dtype=COLUMNS,
        na_values=["---", "------"],
    )

Results

This ETL has been working since Spring 2022 on a daily basis and has been surprisingly robust.

On the odd day that the weather station is offline at the time of extraction, it is possible to manually re-run the failed pipeline in order to catch up for missing data. If this is the case, Github will send me a notification email and I will re-run the action at my convenience. There is an 8-day limit to recover missing data, as it will stop being available on the website.

On a couple of rare occasions there has been slight data corruption related to mismatches between the raw txt file and the data parser. These have been treated as bugs and fixed after adding the relevant unit tests. From looking at the Github commit history, one could say that migrating python package managers, from poetry to rye to uv, has been more of a headache than fixing bugs.

The nature of this solution obeys the conditions of going very cheap into coding a hobby ETL for a very small amount of data. I would carry very little of what has been implemented in this project to a professional production environment, with perhaps the exception of choosing a very simple initial architecture and then introducing complexity by small incremental steps as dictated by need.

A few obvious more serious choices:

Conclusion

Nowadays, after it rains, I have a choice:


Header photo by Alex Dukhanov on Unsplash.

← back to posts