BCN PyDay 2020 Counting votes with Dask
I have given a talk at the Barcelona PyDay 2020, which is taking place as an online event on December 12th, 2020.
The title of the talk is Counting votes: analyzing a large dataset with Dask.
The purpose of the talk is to exhibit a use case in which Dask excels, in my opinion: working with datasets which do not fit into single machine memory but still are reasonably small not to require going to a cluster.
Dask is designed with the intention of parallelizing widely used Python libraries such as numpy, pandas and scikit-learn, and it introduces very little overhead if you are already familiar with these. For this reason, it does an excellent task in the given use case.
Since November has been a month where the world has been closely watching the events related to the election that took place in a faraway country, the opportunity was too good to miss. This is why I took the time to generate a large dataset, which does not fit into my laptop’s memory, to be used as an example during the talk. The tasks? Figure out who won, who won if late votes are not counted, and checking that nobody voted twice.
Materials
-
The video recording, published at PyBCN’s youtube channel.
-
The slides, done using beautiful.ai.
-
The notebooks for the demo, together for instructions for replicating the demo at home using Jupyter Lab and its Dask Extension.
-
The code for dataset generation.