HyperLogLog (HLL) Intersections
HyperLogLog Intersections are interesting in that they allow to derive more information than what would normally
be available with union
or cardinality
. Based on intersections, it is possible to describe relationships
between two sets quantitatively.
What this means is shown in the image above. In this example, when using two HLL sets, based on post ids that relate to different terms used on Social Media, it is possible to return the percentage of posts containing a common list of words, based on intersection.
These two lists of words may first be combined by union of individual HLL sets, for each term, and only afterwards intersected, allowing to study arbitrary semantic relationships.
I’ve discussed and shown HLL intersections on several occasions, this is a list of link with further information:
- Link Jupyter Notebook HTML containing simple SQL code to perform HLL intersections, based on sunset-sunrise posts example.
- Link Jupyter Notebook HTML containing sample code to spatially union HLL sets for different countries and then intersecting these, to derive common visitor counts (YFCC dataset)
- Link Jupyter Notebook HTML it is tested whether intersections allow identifications of individual users, e.g. to compromise privacy (YFCC dataset)
- Slides HLL Union & Intersections, from a meeting with TU Berlin, from where the picture above is taken. Also see this post
- Code Snippets for HLL intersections