I have given a talk at the Barcelona PyDay 2020, which is taking place as an online event on December 12th, 2020.

The title of the talk is Counting votes: analyzing a large dataset with Dask.

The purpose of the talk is to exhibit a use case in which Dask excels, in my opinion: working with datasets which do not fit into single machine memory but still are reasonably small not to require going to a cluster.

Dask is designed with the intention of parallelizing widely used Python libraries such as numpy, pandas and scikit-learn, and it introduces very little overhead if you are already familiar with these. For this reason, it does an excellent task in the given use case.

Since November has been a month where the world has been closely watching the events related to the election that took place in a faraway country, the opportunity was too good to miss. This is why I took the time to generate a large dataset, which does not fit into my laptop’s memory, to be used as an example during the talk. The tasks? Figure out who won, who won if late votes are not counted, and checking that nobody voted twice.

Materials