In this post I gather a few comments and give some examples on the usage of data classes in Python.

Data classes

Data classes were introduced in Python 3.7. We could summarize them as a convenient way to represent data, since classes decorated with @dataclass will supply methods such as __init__() or __repr__() without having to define them.

Parameter types are indicated using type annotations, so a data class declaration will look like this:

from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: str

Type safety in Python

A significant amount of work has been done in recent Python versions in order to provide tools for type safety, and more is to come in future versions.

Personally, I have grown more and more concerned about this issue, especially after learning Scala. Providing type hints and using mypy along pytest has become a staple in my TDD, clean-coding workflow.

Python data classes could perhaps be described as a best effort to obtain something that looks like a Scala case class. The analogy works because both allow passing in typed parameters and have automated instantiation. However, the analogy stops here, and this is mostly due to the nature of both programming languages.

With the definition above, the instantiation below is correct.

In [1]: Person("Bertu", 18)
Out[1]: Person(name='Bertu', age=18)

And the following also works without raising an exception:

In [2]: Person("Bertu", "18")
Out[2]: Person(name='Bertu', age='18')

We would have to rely on mypy to point out that age is expected to be an integer.

At the end of the day, Python is dynamically typed and checking the types of attributes in data classes at runtime is just as bad an idea as checking types of variables elsewhere.

The improvements in type hinting are so that an external tool, such as mypy, can be of help here, by doing some of the checks that a compiler does in other languages.

In conclusion: do not expect a data class to raise an exception during runtime because it received a string instead of an integer.

Python objects are mutable

In Scala, and other functional languages, we speak of values instead of variables, and by this we mean that values are immutable. This is a must-have when manipulating large amounts of data and computing in parallel.

Python data classes can be made to look immutable. More precisely, they may be frozen:

from dataclasses import dataclass

@dataclass(frozen=True)
class Person:
    name: str
    age: str

Now, attribute updates after instantiation will raise a FrozenInstanceError exception. This is a next-best to having an immutable value, but always remember that everything in Python is mutable.

How would we go about updating a frozen data class after instantiation? The init parameters are governed by an underlying dictionary, so it is enough to access that dictionary and update it.

In [3]: b = Person("Bertu", 18)

In [4]: b.age = 20
-----------------------------------------------
FrozenInstanceErrorTraceback (most recent call last)
<ipython-input-4-590a00e1d903> in <module>
----> 1 b.age = 20

<string> in __setattr__(self, name, value)

FrozenInstanceError: cannot assign to field 'age'

In [5]: b.__dict__["age"] = 20

In [6]: b
Out[6]: Person(name='Bertu', age=20)

Admittedly, in practice it is a good idea to think of frozen data classes as immutable, given that one has to go a long way out of their path in order to cheat.

Storing logic in data classes

Since data classes provide automatic __init__() methods, they also provide an interesting __post_init__() method that runs automatically after instantiation.

For me, this method can be used to validate inputs and generate any extra data that can be deduced from the provided parameters.

In a recent project, I decided to use these methods at several places. As an example, at some point I needed to encode which days in a month have been taken off by a person as holidays. These can be:

  • Weekends: given as a list of which days in the week are taken off regularly; typically Saturdays and Sundays.
  • Holidays: either bank holidays or vacations.

I defined a data class describing this structure and performing a post-init validation as follows:

from dataclasses import dataclass, field
from typing import List

@dataclass
class DatesOff:
    weekdays: List[int] = field(default_factory=list)
    holidays: List[int] = field(default_factory=list)

    def __post_init__(self):
        if any([wd not in range(1, 8) for wd in self.weekdays]):
            raise ValueError(
                "Weekdays off not between 1 and 7: %s" % str(self.weekdays)
            )
        if any([d not in range(1, 32) for d in self.holidays]):
            raise ValueError(
                "Holidays given are not between 1 and 31: %s" % str(self.holidays)
            )

One important caveat: if you freeze a data class, you will not be able to define new parameters in __post_init__().

Note: Python 3.9, coming out very soon, implements PEP 585, which includes of built-in generic types. In particular, it will be possible to stop importing typing.List and leave list as a type hint:

weekdays: list[int]

The same applies to dict and a few other frequently used types.

Conclusion

I would say that data classes are a very handy tool when it comes to representing and manipulating data and, in many cases, they will be the right tool to get the job done.

Their limitations are defined by the limitations of Python as a programming language, in particular dynamic typing and mutability.

References