Hashing documents during the Cookiepocalypse
The FLoC proposal
As part of their phasing-out of third party cookies and rolling out of the Privacy Sandbox, Google have announced their intentions to replace third party cookies in Chrome. The strategy is to replace each of the functions served by third party cookies in web advertising, one by one, by individual devices that preserve user privacy.
Federated Learning of Cohorts (FLoC) is the replacement for cross-site user tracking on Google Chrome. In order to enable targeting based on browsing interests without cookie identifiers, the user’s browser computes a cohort identifier, that is: a hash of the browsing history.
We expect the following to happen:
-
Users with similar browsing behaviours end up having the same hash, so it is still possible to show relevant ads by targeting a given cohort id.
-
Cohorts contain enough users to allow a hiding in the pack effect, where an individual’s activity is not distinguishable.
Enter locality-sensitive hashing techniques
The candidate algorithm to achieve the above goal is SimHash, which is commonly used to detect near-duplicates by search engines.
After vectorizing a user’s browsing history, the p
-bit SimHash
algorithm works in the following way:
-
Choose
p
random unit vectors in feature space. These determinep
hyperplanes passing through origin, by relating each hyperplane to its normal vector. -
These
p
hyperplanes determine a partition of space into2^p
regions. -
Encode each of the above regions as an integer in base
2
(ie: pick an ordering of thep
unit vectors). -
The hash of a vector (ie: a user’s browsing history) is determined by the region to which it belongs.
An implementation in Python
When we set out to study the impact of replacing cookies by cohorts in our Machine Learning pipelines at Hybrid Theory, I was unable to find an implementation of the SimHash algorithm in pure Python that served the purposes. So we coded and open-sourced one, and published it as a python package under the name of floc-simhash.
This implementation has allowed us to conduct preliminary analysis on our own datasets, while we wait for Google’s Origin Trial to begin later in this month of March, 2021.
While waiting for details regarding the final form of the FLoC implementation, we have provided two implementations of SimHash:
-
One based directly on bitwise arithmetic of md5 hashes of individual tokens, aimed at hashing any given text document.
-
One designed as a scikit-learn pipeline, that works on a document vectorization such as OneHot or TfIdf, taking advantage of the vectorization power of numpy arrays and scipy sparse matrices.
For more details and examples, please refer to the README.
Open questions
While we wait for implementation details, the most recent news point towards:
-
A hashing algorithm combining
SimHash
with the so-calledSortingLHS
technique, which consists of cropping the number of bits in the hash in order to have less cohorts. The resulting hashes lie around50
-bit precision. -
User browsing history to be constrained to the previous seven days.
-
An implementation of FLoC which will be computed directly on the user’s browser. Thus, browsing history need not be shared with a central API for the purposes of computing cohort ids.
These, however, will be subject to change depending on the Origin Trial results.
More generally, the question regarding the eventual adoption of FLoC remains.
In the next few months, we aim at deciding whether the cookiepocalypse results in having to re-think the whole approach to selling ads online or, as claimed in the original proposal, FLoCs are performant enough to allow for the cross-targeting approach to survive on Google Chrome.
References
-
GitHub repository, containing installation instructions and examples.
-
PyPI package. Please install by calling
pip install
. -
Post in our engineering blog at Hybrid Theory.
-
Company announcement at LinkedIn.
-
The repository for the FLoC proposal at GitHub.
-
The proposal whitepaper.