Intepretation of HLL data:
Comparison, Interactive Exploration, Benchmark Data

Introduction

This is the last notebook in a series of four notebooks:

In this notebook, we'll illustrate how to make further analyses based on the published hll benchmark dataset.

  • create interactive map with geoviews, adapt visuals, styling and legend
  • combine results from raw and hll into interactive map (on hover)
  • store interactive map as standalone HTML
  • exporting benchmark data
  • intersecting hll sets for frequentation analysis

Preparations

Parameters

Load global settings and methods from 03_yfcc_gridagg_hll notebook. This will also load all parameters and methods defined in 02_yfcc_gridagg_raw.

In [3]:
import sys
from pathlib import Path
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)

from _03_yfcc_gridagg_hll import *

Load additional dependencies

Load all dependencies at once, as a means to verify that everything required to run this notebook is available.

In [4]:
import geoviews as gv
import holoviews as hv
import geoviews.feature as gf
import pickle
import ipywidgets as widgets
from ipywidgets.embed import embed_minimal_html
from rtree import index
from geopandas.tools import sjoin
from geoviews import opts
from bokeh.models import HoverTool, FixedTicker
from matplotlib_venn import venn3, venn3_circles

Enable bokeh

In [3]:
preparations.init_imports()

Activate autoreload of changed python files:

In [4]:
%load_ext autoreload
%autoreload 2

Load memory profiler:

In [5]:
%load_ext memory_profiler

Print versions of packages used herein, for purposes of replication:

In [6]:
root_packages = [
        'ipywidgets', 'geoviews', 'geopandas', 'holoviews', 'rtree', 'matplotlib-venn', 'xarray']
preparations.package_report(root_packages)
package xarray Rtree matplotlib-venn ipywidgets holoviews geoviews geopandas
version 0.16.1 0.9.4 0.11.5 7.5.1 1.13.4 1.8.1 0.8.1

Interactive Map with Holoviews/ Geoviews

Geoviews and Holoviews can be used to create interactive maps, either insider Jupyter or as externally stored as HTML. The syntax of the plot methods must be slightly adapted from the matplotlib output. Given the advanced features of interactivity, we can also add additional information that is shown e.g. on mouse hover over certain grid cells.

Load & plot pickled dataframe

Loading (geodataframe) using pickle. This is the easiest way to load intermediate data, but may be incompatible if package versions change. If loading pickles does not work, a workaround is to load data from CSV and re-create pickle data, which will be compatible with used versions. Have a look at 03_yfcc_gridagg_hll.ipynb to recreate pickles with current package versions.

Load pickle and merge results of raw and hll dataset:

In [7]:
grid_est = pd.read_pickle(OUTPUT / "pickles" / "yfcc_all_est.pkl")
grid_raw = pd.read_pickle(OUTPUT / "pickles" / "yfcc_all_raw.pkl")
In [8]:
grid = grid_est.merge(
    grid_raw[['postcount', 'usercount', 'userdays']],
    left_index=True, right_index=True)

Have a look at the numbers for exact and estimated values. Smaller values are exact in both hll and raw because Sparse Mode is used.

In [9]:
grid[grid["usercount_est"]>5].head()
Out[9]:
geometry postcount_est usercount_est userdays_est postcount usercount userdays
xbin ybin
-17640096 -2020048 POLYGON ((-17640096.000 -2020048.000, -1754009... 630 19 68 634 19 67
-17040096 -1620048 POLYGON ((-17040096.000 -1620048.000, -1694009... 456 17 82 449 17 81
-16940096 -1620048 POLYGON ((-16940096.000 -1620048.000, -1684009... 990 39 125 960 39 125
-1720048 POLYGON ((-16940096.000 -1720048.000, -1684009... 120 11 36 124 11 36
-2220048 POLYGON ((-16940096.000 -2220048.000, -1684009... 171 13 37 176 13 37

Prepare interactive mapping

Some updates to the plotting function are necessary, for compatibility with geoviews/holoviews interactive mapping.

In [10]:
def label_nodata_xr(
        grid: gp.GeoDataFrame, inverse: bool = None,
        metric: str = "postcount_est", scheme: str = None,
        label_nonesignificant: bool = None, cmap_name: str = None):
    """Create a classified colormap from pandas dataframe
    column with additional No Data-label
        
    Args:
        grid: A geopandas geodataframe with metric column to classify
        inverse: If True, colors are inverted (dark)
        metric: The metric column to classify values
        scheme: The classification scheme to use.
        label_nonesignificant: If True, adds an additional
            color label for none-significant values based
            on column "significant"
        cmap_name: The colormap to use.
            
    Adapted from:
        https://stackoverflow.com/a/58160985/4556479
    See available colormaps:
        http://holoviews.org/user_guide/Colormaps.html
    See available classification schemes:
        https://pysal.org/mapclassify/api.html
    """
    # get headtail_breaks
    # excluding NaN values
    grid_nan = grid[metric].replace(0, np.nan)
    # get classifier scheme by name
    classifier = getattr(mc, scheme)
    # some classification schemes (e.g. HeadTailBreaks)
    # do not support specifying the number of classes returned
    # construct optional kwargs with k == number of classes
    optional_kwargs = {"k":9}
    if scheme == "HeadTailBreaks":
        optional_kwargs = {}
    # explicitly set dtype to float,
    # otherwise mapclassify will error out due
    # to unhashable type numpy ndarray
    # from object-column
    scheme_breaks = classifier(
        grid_nan.dropna().astype(float), **optional_kwargs)
    # set breaks category column
    # to spare cats:
    # increase by 1, 0-cat == NoData/none-significant
    spare_cats = 1
    grid[f'{metric}_cat'] = scheme_breaks.find_bin(
        grid_nan) + spare_cats
    # set cat 0 to NaN values
    # to render NoData-cat as transparent
    grid.loc[grid_nan.isnull(), f'{metric}_cat'] = np.nan
    # get label bounds as flat array
    bounds = get_label_bounds(
        scheme_breaks, grid_nan.dropna().values, flat=True)
    cmap_name = cmap_name
    cmap = plt.cm.get_cmap(cmap_name, scheme_breaks.k)
    # get hex values
    cmap_list = [colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
    # prepend nodata color
    # shown as white in legend
    first_color_legend = '#ffffff'
    if inverse:
        first_color_legend = 'black'
    cmap_list = [first_color_legend] + cmap_list
    cmap_with_nodata = colors.ListedColormap(cmap_list)
    return cmap_with_nodata, bounds, scheme_breaks

Since interactive display of grid-polygons is too slow, we're converting the grid to an xarray object, which is then overlayed as a rastered image. Also added here is the option to show additional columns on hover, which will be later used to compare raw and hll values.

In [11]:
def get_custom_tooltips(items: Dict[str, str]) -> str:
    """Compile HoverTool tooltip formatting with items to show on hover"""
    # thousands delimitor formatting
    # will be applied to the for the following columns
    tdelim_format = [
        'usercount_est', 'postcount_est', 'userdays_est',
        'usercount', 'postcount', 'userdays']
    # in HoloViews, custom tooltip formatting can be
    # provided as a list of tuples (name, value)
    tooltips=[
        # f-strings explanation:
        # - k: the item name, v: the item value,
        # - @ means: get value from column
        # optional formatting is applied using
        # `"{,f}" if v in thousand_formats else ""`
        # - {,f} means: add comma as thousand delimiter
        # only for values in tdelim_format (usercount and postcount)
        # else apply automatic format
        (k, 
         f'@{v}{"{,f}" if v.replace("_expected", "") in tdelim_format else ""}'
        ) for k, v in items.items()
    ]
    return tooltips

def set_active_tool(plot, element):
    """Enable wheel_zoom in bokeh plot by default"""
    plot.state.toolbar.active_scroll = plot.state.tools[0]

def convert_gdf_to_gvimage(
        grid, metric, cat_count,
        additional_items: Dict[str, str] = None) -> gv.Image:
    """Convert GeoDataFrame to gv.Image using categorized
    metric column as value dimension
    
    Args:
        grid: A geopandas geodataframe with indexes x and y 
            (projected coordinates) and aggregate metric column
        metric: target column for value dimension.
            "_cat" will be added to retrieve classified values.
        cat_count: number of classes for value dimension
        additional_items: a dictionary with optional names 
            and column references that are included in 
            gv.Image to provide additional information
            (e.g. on hover)
    """
    if additional_items is None:
        additional_items_list = []
    else:
        additional_items_list = [
            v for v in additional_items.values()]
    # convert GeoDataFrame to xarray object
    # the first vdim is the value being used 
    # to visualize classes on the map
    # include additional_items (postcount and usercount)
    # to show exact information through tooltip
    xa_dataset = gv.Dataset(
        grid.to_xarray(),
        vdims=[
            hv.Dimension(
                f'{metric}_cat', range=(0, cat_count))]
            + additional_items_list,
        crs=crs.Mollweide())
    return xa_dataset.to(gv.Image, crs=crs.Mollweide())
    
def plot_interactive(grid: gp.GeoDataFrame, title: str,
    metric: str = "postcount_est", store_html: str = None,
    additional_items: Dict[str, str] = {
        'Post Count':'postcount',
        'User Count':'usercount',
        'User Days':'userdays',
        'Post Count (estimated)':'postcount_est', 
        'User Count (estimated)':'usercount_est',
        'User Days (estimated)':'userdays_est'},
    scheme: str = "HeadTailBreaks",
    cmap: str = "OrRd",
    inverse: bool = None,
    output: Path = OUTPUT):
    """Plot interactive map with holoviews/geoviews renderer

    Args:
        grid: A geopandas geodataframe with indexes x and y 
            (projected coordinates) and aggregate metric column
        metric: target column for aggregate. Default: postcount.
        store_html: Provide a name to store figure as interactive HTML.
        title: Title of the map
        additional_items: additional items to show on hover
        scheme: The classification scheme to use. Default "HeadTailBreaks".
        cmap: The colormap to use. Default "OrRd".
        inverse: plot other map elements inverse (black)
        output: main folder (Path) for storing output
    """
    # check if all additional items are available
    for key, item in list(additional_items.items()):
        if item not in grid.columns:
            additional_items.pop(key)
    # work on a shallow copy to not modify original dataframe
    grid_plot = grid.copy()
    # classify metric column
    cmap_with_nodata, bounds, headtail_breaks = label_nodata_xr(
        grid=grid_plot, inverse=inverse, metric=metric,
        scheme=scheme, cmap_name=cmap)
    # construct legend labels for colormap
    label_dict = {}
    for i, s in enumerate(bounds):
        label_dict[i+1] = s
    label_dict[0] = "No data"
    cat_count = len(bounds)
    # create gv.image layer from gdf
    img_grid = convert_gdf_to_gvimage(
            grid=grid_plot,
            metric=metric, cat_count=cat_count,
            additional_items=additional_items)
    # define additional plotting parameters
    # width of static jupyter map,
    # 360° == 1200px
    width = 1200 
    # height of static jupyter map,
    # 360°/2 == 180° == 600px
    height = int(width/2) 
    responsive = False
    aspect = None
    # if stored as html,
    # override values
    if store_html:
        width = None
        height = None
        responsive = True
    # define width and height as optional parameters
    # only used when plotting inside jupyter
    optional_kwargs = dict(width=width, height=height)
    # compile only values that are not None into kwargs-dict
    # by using dict-comprehension
    optional_kwargs_unpack = {
        k: v for k, v in optional_kwargs.items() if v is not None}
    # prepare custom HoverTool
    tooltips = get_custom_tooltips(
        additional_items)
    hover = HoverTool(tooltips=tooltips) 
    # create image layer
    image_layer = img_grid.opts(
            color_levels=cat_count,
            cmap=cmap_with_nodata,
            colorbar=True,
            clipping_colors={'NaN': 'transparent'},
            colorbar_opts={
                # 'formatter': formatter,
                'major_label_text_align':'left',
                'major_label_overrides': label_dict,
                'ticker': FixedTicker(
                    ticks=list(range(0, len(label_dict)))),
                },
            tools=[hover],
            # optional unpack of width and height
            **optional_kwargs_unpack
        )
    edgecolor = 'black'
    bg_color = None
    fill_color = '#479AD4'
    alpha = 0.1
    if inverse:
        alpha = 1
        bg_color = 'black'
        edgecolor = 'white'
    # combine layers into single overlay
    # and set global plot options
    gv_layers = (gf.ocean.opts(alpha=alpha, fill_color=fill_color) * \
                 image_layer * \
                 gf.coastline.opts(line_color=edgecolor) * \
                 gf.borders.opts(line_color=edgecolor)
            ).opts(
        bgcolor=bg_color,
        # global_extent=True,
        projection=crs.Mollweide(),
        responsive=responsive,
        data_aspect=1, # maintain fixed aspect ratio during responsive resize
        hooks=[set_active_tool],
        title=title)
    if store_html:
        hv.save(gv_layers, output / "html" / f'{store_html}.html', backend='bokeh')
    else:
        return gv_layers

plot interactive in-line:

In [12]:
gv_plot = plot_interactive(
    grid, title=f'YFCC User Count (estimated) per {int(GRID_SIZE_METERS/1000):.0f}km grid', metric="usercount_est",)
gv_plot
Out[12]:

store as html:

In [13]:
plot_interactive(
    grid, title=f'YFCC User Count (estimated) per {int(GRID_SIZE_METERS/1000):.0f}km grid', metric="usercount_est",
    store_html="yfcc_usercount_est")

Interactive map

View the interactive map here and hover over bins to compare exact (raw) vs. approximate (hll) results.

Compare RAW and HLL

We're almost at the end of the end of the three-part tutorial. The last step is to compare and interpret results.

Post Count, User Count and User Days

Load png plots as widgets.Image and display using widgets.Tab

In [14]:
# dictionary with filename and title
pathrefs = {
    0: ('yfcc_postcount.png', 'Post Count raw'),
    1: ('yfcc_postcount_est.png', 'Post Count hll'),
    2: ('yfcc_usercount.png', 'User Count raw'),
    3: ('yfcc_usercount_est.png', 'User Count hll'),
    4: ('yfcc_userdays.png', 'User Days raw'),
    5: ('yfcc_userdays_est.png', 'User Days hll'),
    6: ('postcount_sample.png', 'Post Count (Italy) raw'),
    7: ('postcount_sample_est.png', 'Post Count (Italy) hll'),
    8: ('usercount_sample.png', 'User Count (Italy) raw'),
    9: ('usercount_sample_est.png', 'User Count (Italy) hll'),
    10: ('userdays_sample.png', 'User Days (Italy) raw'),
    11: ('userdays_sample_est.png', 'User Days (Italy) hll'),
    }

def get_img_width(filename: str):
    if 'sample' in filename:
        return 700
    return 1300

widgets_images = [
    widgets.Image(
        value=open(OUTPUT / "figures" / pathref[0], "rb").read(),
        format='png',
        width=get_img_width(pathref[0])
     )
    for pathref in pathrefs.values()]

Configure tabs

In [15]:
children = widgets_images
tab = widgets.Tab()
tab.children = children
for i in range(len(children)):
    tab.set_title(i, pathrefs[i][1])

Display inside live notebook:

In [16]:
tab

The above tab display is not available in the static notebook HTML export. A standalone HTML with the above tabs can be generated with the following command:

In [17]:
embed_minimal_html(
    OUTPUT / 'html' / 'yfcc_compare_raw_hll.html',
    views=[tab], title='YFCC worldwide raw and hll metrics')

View the result here.

Working with the Benchmark data: Intersection Example

HLL sets are more than just statistic summaries. Except for removing elements, they offer the regular set operations such as lossless union. Based on the Inclusion–exclusion principle, Unions of two or more sets can be used to calculate the intersection, i.e. the common elements between all sets. In the following, intersection is calculated for common users between France, Germany, and UK.

Load Benchmark data

In 03_yfcc_gridagg_hll.ipynb, aggregate grid data was stored to yfcc_all_est_benchmark.csv, including hll sets with usercount cardinality > 100. Load this data first, using functions from previous notebooks.

Read benchmark data, only loading usercount and usercount_hll columns.

In [18]:
grid = grid_agg_fromcsv(
    OUTPUT / "csv" / "yfcc_all_est_benchmark.csv",
    columns=["xbin", "ybin", "usercount_hll"],
    metrics=["usercount_est"],
    grid_size=GRID_SIZE_METERS)
In [19]:
grid[grid["usercount_est"]>5].head()
Out[19]:
geometry usercount_est usercount_hll
xbin ybin
-15340096 2779952 POLYGON ((-15340096.000 2779952.000, -15240096... 679.0 \x138b400001002100c2014101e303c2048104a7052405...
2679952 POLYGON ((-15340096.000 2679952.000, -15240096... 275.0 \x138b40002100c20161048105e107c208610b230e8210...
-15240096 2779952 POLYGON ((-15240096.000 2779952.000, -15140096... 419.0 \x138b40002100a200c20103014101e304810524054405...
2679952 POLYGON ((-15240096.000 2679952.000, -15140096... 1890.0 \x148b4000c00184001002241003084441080210022108...
-15140096 2679952 POLYGON ((-15140096.000 2679952.000, -15040096... 1095.0 \x148b4000000000001800208003084040080000002180...

Union hll sets for Countries UK, DE and FR

Selection of grid cells based on country geometry

Load country geometry:

In [20]:
world = gp.read_file(
    gp.datasets.get_path('naturalearth_lowres'),
    crs=CRS_WGS)
world = world.to_crs(CRS_PROJ)

Select geometry for DE, FR and UK

In [21]:
de = world[world['name'] == "Germany"]
uk = world[world['name'] == "United Kingdom"]
fr = world[world['name'] == "France"]

Drop French territory of French Guiana:

In [22]:
fr = fr.explode().iloc[1:].dissolve(by='name')
fr.plot()
Out[22]:
<AxesSubplot:>

Preview selection. Note that the territory of France includes Corsica, which is acceptable for the example use case.

In [23]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.suptitle(
    'Areas to test for common visitors in the hll benchmark dataset')
for ax in (ax1, ax2, ax3):
    ax.set_axis_off()
ax1.title.set_text('DE')
ax2.title.set_text('UK')
ax3.title.set_text('FR')
de.plot(ax=ax1)
uk.plot(ax=ax2)
fr.plot(ax=ax3)
Out[23]:
<AxesSubplot:title={'center':'FR'}>

Intersection with grid

Since grid size is 100 km, direct intersection will yield some error rate. Use centroid of grid cells to select bins based on country geometry.

Get centroids as Geoseries and turn into GeoDataFrame:

In [24]:
centroid_grid = grid.centroid.reset_index()
centroid_grid.set_index(["xbin", "ybin"], inplace=True)
In [25]:
grid.centroid
Out[25]:
xbin       ybin    
-18040096   8979952    POINT (-17990096.000 8929952.000)
            8879952    POINT (-17990096.000 8829952.000)
            8779952    POINT (-17990096.000 8729952.000)
            8679952    POINT (-17990096.000 8629952.000)
            8579952    POINT (-17990096.000 8529952.000)
                                     ...                
 17959904  -8620048    POINT (18009904.000 -8670048.000)
           -8720048    POINT (18009904.000 -8770048.000)
           -8820048    POINT (18009904.000 -8870048.000)
           -8920048    POINT (18009904.000 -8970048.000)
           -9020048    POINT (18009904.000 -9070048.000)
Length: 65341, dtype: geometry

Define function to intersection, using geopandas sjoin (spatial join)

In [26]:
def intersect_grid_centroids(
    grid: gp.GeoDataFrame, 
    intersect_gdf: gp.GeoDataFrame):
    """Return grid centroids from grid that 
    intersect with intersect_gdf
    """
    centroid_grid = gp.GeoDataFrame(
        grid.centroid)
    centroid_grid.rename(
        columns={0:'geometry'},
        inplace=True)
    centroid_grid.set_geometry(
        'geometry', crs=grid.crs, 
        inplace=True)
    grid_intersect = sjoin(
        centroid_grid, intersect_gdf, 
        how='right')
    grid_intersect.set_index(
        ["index_left0", "index_left1"],
        inplace=True)
    grid_intersect.index.names = ['xbin','ybin']
    return grid.loc[grid_intersect.index]

Run intersection for countries:

In [27]:
grid_de = intersect_grid_centroids(
    grid=grid, intersect_gdf=de)
grid_de.plot(edgecolor='white')
Out[27]:
<AxesSubplot:>
In [28]:
grid_fr = intersect_grid_centroids(
    grid=grid, intersect_gdf=fr)
grid_fr.plot(edgecolor='white')
Out[28]:
<AxesSubplot:>
In [29]:
grid_uk = intersect_grid_centroids(
    grid=grid, intersect_gdf=uk)
grid_uk.plot(edgecolor='white')
Out[29]:
<AxesSubplot:>

Plot preview of selected grid cells (bins)

Define colors:

In [30]:
color_de = "#fc4f30"
color_fr = "#008fd5"
color_uk = "#6d904f"

Define map boundary:

In [31]:
bbox_europe = (
    -9.580078, 41.571384,
    16.611328, 59.714117)
minx, miny = PROJ_TRANSFORMER.transform(
    bbox_europe[0], bbox_europe[1])
maxx, maxy = PROJ_TRANSFORMER.transform(
    bbox_europe[2], bbox_europe[3])
buf = 100000
In [32]:
def plot_map(
    grid: gp.GeoDataFrame, sel_grids: List[gp.GeoDataFrame],
    sel_colors: List[str],
    title: Optional[str] = None, save_fig: Optional[str] = None,
    ax = None):
    """Plot GeoDataFrame with matplotlib backend, optionaly export as png"""
    if not ax:
        fig, ax = plt.subplots(1, 1, figsize=(5, 6))
    ax.set_xlim(minx-buf, maxx+buf)
    ax.set_ylim(miny-buf, maxy+buf)
    if title:
        ax.set_title(title, fontsize=12)
    for ix, sel_grid in enumerate(sel_grids):
        sel_grid.plot(
            ax=ax,
            color=sel_colors[ix],
            edgecolor='white',
            alpha=0.9)
    grid.boundary.plot(
        ax=ax,
        edgecolor='black', 
        linewidth=0.1,
        alpha=0.9)
    # combine with world geometry
    world.plot(
        ax=ax, color='none', edgecolor='black', linewidth=0.3)
    # turn axis off
    ax.set_axis_off()
    if not save_fig:
        return
    fig.savefig(OUTPUT / "figures" / save_fig, dpi=300, format='PNG',
               bbox_inches='tight', pad_inches=1)
In [33]:
sel_grids=[grid_de, grid_uk, grid_fr]
sel_colors=[color_de, color_uk, color_fr]
plot_map(
    grid=grid, sel_grids=sel_grids, 
    sel_colors=sel_colors,
    title='Grid selection for DE, FR and UK',
    save_fig='grid_selection_countries.png')

Union of hll sets

Connect to hll worker:

In [5]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
# set connection variables
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"

db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_connection.set_session(readonly=True)
db_conn = tools.DbConn(db_connection)
db_conn.query("SELECT 1;")
Out[5]:
?column?
0 1

Use union_all function from 03_yfcc_gridagg_hll.ipynb, but modify to union all hll sets in a pd.Series (union_all), instead of grouped union.

In [6]:
def union_all_hll(
    hll_series: pd.Series, cardinality: bool = True, db_conn = None) -> pd.Series:
    """HLL Union and (optional) cardinality estimation from series of hll sets

        Args:
        hll_series: Indexed series (bins) of hll sets. 
        cardinality: If True, returns cardinality (counts). Otherwise,
            the unioned hll set will be returned.
    """
    hll_values_list = ",".join(
        [f"(0::int,'{hll_item}'::hll)"
         for hll_item in hll_series.values.tolist()])
    return_col = "hll_union"
    hll_calc_pre = ""
    hll_calc_tail = "AS hll_union"
    if cardinality:
        return_col = "hll_cardinality"
        hll_calc_pre = "hll_cardinality("
        hll_calc_tail = ")::int"
    db_query = f"""
        SELECT sq.{return_col} FROM (
            SELECT s.group_ix,
                   {hll_calc_pre}
                   hll_union_agg(s.hll_set)
                   {hll_calc_tail}
            FROM (
                VALUES {hll_values_list}
                ) s(group_ix, hll_set)
            GROUP BY group_ix
            ORDER BY group_ix ASC) sq
        """
    df = db_conn.query(db_query)
    return df[return_col]

Calculate distinct users per country:

In [36]:
grid_sel = {
    "de": grid_de,
    "uk": grid_uk,
    "fr": grid_fr
}
distinct_users_total = {}
for country, grid_sel in grid_sel.items():
    # drop bins with no values
    cardinality_series = union_all_hll(
        grid_sel["usercount_hll"].dropna(),
        db_conn=db_conn)
    distinct_users_total[country] = cardinality_series[0]
    print(
        f"{distinct_users_total[country]} distinct users "
        f"who shared YFCC100M photos in {country.upper()}")
24318 distinct users who shared YFCC100M photos in DE
31290 distinct users who shared YFCC100M photos in UK
24947 distinct users who shared YFCC100M photos in FR

Calculate intersection (common visitors)

According to the Union-intersection-principle:

$|A \cup B| = |A| + |B| - |A \cap B|$

which can also be written as:

$|A \cap B| = |A| + |B| - |A \cup B|$

Therefore, unions can be used to calculate intersection. Calculate $|DE \cup FR|$, $|DE \cup UK|$ and $|UK \cup FR|$, i.e.:

hll_cardinality(grid_de)::int + 
hll_cardinality(grid_fr)::int - 
hll_cardinality(hll_union(grid_de, grid_fr') = IntersectionCount
In [37]:
union_de_fr = pd.concat([grid_de, grid_fr])
union_de_uk = pd.concat([grid_de, grid_uk])
union_uk_fr = pd.concat([grid_uk, grid_fr])
In [38]:
grid_sel = {
    "de-uk": union_de_uk,
    "de-fr": union_de_fr,
    "uk-fr": union_uk_fr
}
distinct_common = {}
for country_tuple, grid_sel in grid_sel.items():
    cardinality_series = union_all_hll(
        grid_sel["usercount_hll"].dropna(),
        db_conn=db_conn)
    distinct_common[country_tuple] = cardinality_series[0]
    print(
        f"{distinct_common[country_tuple]} distinct total users "
        f"who shared YFCC100M photos from either {country_tuple.split('-')[0]} "
        f"or {country_tuple.split('-')[1]} (union)")
50015 distinct total users who shared YFCC100M photos from either de or uk (union)
42879 distinct total users who shared YFCC100M photos from either de or fr (union)
48496 distinct total users who shared YFCC100M photos from either uk or fr (union)

Calculate intersection:

In [39]:
distinct_intersection = {}
for a, b in [("de", "uk"), ("de", "fr"), ("uk", "fr")]:
    a_total = distinct_users_total[a]
    b_total = distinct_users_total[b]
    common_ref = f'{a}-{b}'
    intersection_count = a_total + b_total - distinct_common[common_ref]
    distinct_intersection[common_ref] = intersection_count
    print(
        f"{distinct_intersection[common_ref]} distinct users "
        f"who shared YFCC100M photos from {a} and {b} (intersection)")
5593 distinct users who shared YFCC100M photos from de and uk (intersection)
6386 distinct users who shared YFCC100M photos from de and fr (intersection)
7741 distinct users who shared YFCC100M photos from uk and fr (intersection)

Finally, lets get the number of users who have shared pictures from all three countries, based on the formula for three sets:

$|A \cup B \cup C| = |A| + |B| + |C| - |A \cap B| - |A \cap C| - |B \cap C| + |A \cap B \cap C|$

which can also be written as:

$|A \cap B \cap C| = |A \cup B \cup C| - |A| - |B| - |C| + |A \cap B| + |A \cap C| + |B \cap C|$

Calculate distinct users of all three countries:

In [40]:
union_de_fr_uk = pd.concat(
    [grid_de, grid_fr, grid_uk])
cardinality_series = union_all_hll(
    union_de_fr_uk["usercount_hll"].dropna(),
    db_conn=db_conn)
union_count_all = cardinality_series[0]
union_count_all
Out[40]:
63613
In [41]:
country_a = "de"
country_b = "uk"
country_c = "fr"

Calculate intersection:

In [42]:
intersection_count_all = union_count_all - \
    distinct_users_total[country_a] - \
    distinct_users_total[country_b] - \
    distinct_users_total[country_c] + \
    distinct_intersection[f'{country_a}-{country_b}'] + \
    distinct_intersection[f'{country_a}-{country_c}'] + \
    distinct_intersection[f'{country_b}-{country_c}']
    
print(intersection_count_all)
2778

Since we're going to visualize this with matplotlib-venn, we need the following variables:

In [43]:
v = venn3(
    subsets=(
        500,
        500, 
        100,
        500,
        100,
        100,
        10),
    set_labels = ('A', 'B', 'C'))
v.get_label_by_id('100').set_text('Abc')
v.get_label_by_id('010').set_text('aBc')
v.get_label_by_id('001').set_text('abC')
v.get_label_by_id('110').set_text('ABc')
v.get_label_by_id('101').set_text('AbC')
v.get_label_by_id('011').set_text('aBC')
v.get_label_by_id('111').set_text('ABC')
plt.show()

We already have ABC, the other values can be calulated:

In [44]:
ABC = intersection_count_all
In [45]:
ABc = distinct_intersection[f'{country_a}-{country_b}'] - ABC
In [46]:
aBC = distinct_intersection[f'{country_b}-{country_c}'] - ABC
In [47]:
AbC = distinct_intersection[f'{country_a}-{country_c}'] - ABC
In [48]:
Abc = distinct_users_total[country_a] - ABc - AbC + ABC
In [49]:
aBc = distinct_users_total[country_b] - ABc - aBC + ABC
In [50]:
abC = distinct_users_total[country_c] - aBC - AbC + ABC

Illustrate intersection (Venn diagram)

Order of values handed over: Abc, aBc, ABc, abC, AbC, aBC, ABC

Define Function to plot Venn Diagram.

In [51]:
def plot_venn(
    subset_sizes: List[int],
    colors: List[str], 
    names: List[str],
    subset_sizes_raw: List[int] = None,
    total_sizes: List[Tuple[int, int]] = None,
    ax = None,
    title: str = None):
    """Plot Venn Diagram"""
    if not ax:
        fig, ax = plt.subplots(1, 1, figsize=(5,5))
    set_labels = (
        'A', 'B', 'C')
    v = venn3(
        subsets=(
            [subset_size for subset_size in subset_sizes]),
        set_labels = set_labels,
        ax=ax)    
    for ix, idx in enumerate(
        ['100', '010', '001']):
        v.get_patch_by_id(
            idx).set_color(colors[ix])
        v.get_patch_by_id(
            idx).set_alpha(0.8)
        v.get_label_by_id(
            set_labels[ix]).set_text(
            names[ix])
        if not total_sizes:
            continue
        raw_count = total_sizes[ix][0]
        hll_count = total_sizes[ix][1]
        difference = abs(raw_count-hll_count)
        v.get_label_by_id(set_labels[ix]).set_text(
            f'{names[ix]}, {hll_count},\n'
            f'{difference/(raw_count/100):+.1f}%')
    if subset_sizes_raw:
        for ix, idx in enumerate(
            ['100', '010', None, '001']):
            if not idx:
                continue
            dif_abs = subset_sizes[ix] - subset_sizes_raw[ix]
            dif_perc = dif_abs / (subset_sizes_raw[ix] / 100)
            v.get_label_by_id(idx).set_text(
                f'{subset_sizes[ix]}\n{dif_perc:+.1f}%')            
    label_ids = [
        '100', '010', '001',
        '110', '101', '011',
        '111', 'A', 'B', 'C']
    for label_id in label_ids:
        v.get_label_by_id(
            label_id).set_fontsize(14)
    # draw borders
    c = venn3_circles(
        subsets=(
            [subset_size for subset_size in subset_sizes]),
        linestyle='dashed',
        lw=1,
        ax=ax)
    if title:
        ax.title.set_text(title)

Plot Venn Diagram:

In [52]:
subset_sizes = [
    Abc, aBc, ABc, abC, AbC, aBC, ABC]
colors = [
    color_de, color_uk, color_fr]
names = [
    'Germany', 'United Kingdom','France']
plot_venn(
    subset_sizes=subset_sizes,
    colors=colors,
    names=names,
    title="Common User Count")

Combine Map & Venn Diagram

In [53]:
# figure with subplot (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize=(10, 24))
plot_map(
    grid=grid, sel_grids=sel_grids, 
    sel_colors=sel_colors, ax=ax[0])
plot_venn(
    subset_sizes=subset_sizes,
    colors=colors,
    names=names,
    ax=ax[1])
# store as png
fig.savefig(
    OUTPUT / "figures" / "figure_2_intersection.png", dpi=300, format='PNG',
    bbox_inches='tight', pad_inches=1)

Get accurate intersection counts (raw)

We're going to work on the raw post data that was stored from raw db using GeoHash accuracy of 5 (~4km Granularity).

In [54]:
usecols = ['latitude', 'longitude', 'user_guid']
dtypes = {'latitude': float, 'longitude': float, 'user_guid': str}

Convert Europe bounding box to Mollweide and get buffer

In [55]:
minx, miny = PROJ_TRANSFORMER.transform(
    bbox_europe[0], bbox_europe[1])
maxx, maxy = PROJ_TRANSFORMER.transform(
    bbox_europe[2], bbox_europe[3])
# apply buffer and convetr back to WGS1984
min_buf = PROJ_TRANSFORMER_BACK.transform(minx-buf, miny-buf)
max_buf = PROJ_TRANSFORMER_BACK.transform(maxx+buf, maxy+buf)
bbox_europe_buf = (min_buf[0], min_buf[1], max_buf[0], max_buf[1])
In [56]:
%%time
int_data = 'user_guids_sets_raw.pkl'
skip_read_from_raw = False
if (OUTPUT / "pickles" / int_data).exists():
    print(
        f"Intermediate data already exists. "
        f"Delete {int_data} to reload from raw.")
    skip_read_from_raw = True
else:
    chunked_df = read_project_chunked(
        filename=OUTPUT / "csv" / "yfcc_posts.csv",
        usecols=usecols,
        bbox=bbox_europe_buf,
        dtypes=dtypes)
    print(len(chunked_df))
    display(chunked_df[0].head())
    display(chunked_df[0]["x"])
Intermediate data already exists. Delete user_guids_sets_raw.pkl to reload from raw.
CPU times: user 878 µs, sys: 47 µs, total: 925 µs
Wall time: 631 µs

Prepare intersection with countries geometry

Explode Countries Geometry and join in single GeoDataFrame.

A bug in geopandas prevents to use explode() on France, but we can use the following workaround:

In [57]:
exploded_geom = fr.geometry.explode().reset_index(level=-1)
exploded_index = exploded_geom.columns[0]
fr_exploded = fr.drop(fr._geometry_column_name, axis=1).join(exploded_geom)
In [58]:
de_uk_fr = pd.concat(
    [de.explode(), uk.explode(), fr_exploded.reset_index()]).reset_index()
de_uk_fr.drop(
    de_uk_fr.columns.difference(['name','geometry']), 1, inplace=True)
In [59]:
de_uk_fr.head()
Out[59]:
name geometry
0 Germany POLYGON ((1018674.630 6261050.121, 1043430.938...
1 United Kingdom POLYGON ((-446404.432 6272269.474, -499277.322...
2 United Kingdom POLYGON ((-224388.341 6225201.172, -224261.698...
3 France POLYGON ((724251.097 5080989.368, 774574.287 5...
4 France POLYGON ((473842.953 5817205.732, 511702.548 5...

Populate R-tree index with bounds of polygons

In [60]:
idx = index.Index()
for pos, poly in enumerate(de_uk_fr.geometry):
    idx.insert(pos, poly.bounds)

Test with single point

In [61]:
coords = (10, 50)
coords = PROJ_TRANSFORMER.transform(
    coords[0], coords[1])
for ix in idx.intersection(coords):
    if de_uk_fr.geometry[ix].contains(Point(coords)):
        print(de_uk_fr.name[ix])
de_uk_fr.geometry[ix]
Germany
Out[61]:

Query post locations to see which polygon it is in using first Rtree index, then Shapely geometry's within

In [62]:
USER_GUIDS_SETS = {
    "Germany": set(),
    "United Kingdom": set(),
    "France": set()
}

def intersect_coord(
    coordinates: Tuple[float, float],
    gdf_country: gp.GeoDataFrame,
    user_guid: str):
    """Intersect coordinates with GeoDataFrame geometry"""
    # rtree bounds intersection
    for ix in idx.intersection(coordinates):
        # exact shape intersection
        if not gdf_country.geometry[ix].contains(
            Point(coordinates)):
            continue
        country = gdf_country.name[ix]
        USER_GUIDS_SETS[country].add(
            user_guid)

This is likely not the most performant solution, but it is easy to read:

In [63]:
def apply_intersect_coord(
    gdf_country: gp.GeoDataFrame,
    chunked_df: List[pd.DataFrame]):
    """Assign user guids from posts from chunked_df to countries
    and union in global sets
    """
    processed_cnt = 0
    for df in chunked_df:
        for post in df.itertuples():
            processed_cnt += 1
            intersect_coord(
                coordinates=(post.x, post.y),
                gdf_country=gdf_country,
                user_guid=post.user_guid)
            # report
            if processed_cnt % 1000 == 0:
                print(f'Processed {processed_cnt:,.0f} posts')
                clear_output(wait=True)

Run:

In [64]:
%%time
%%memit
if not skip_read_from_raw:
    apply_intersect_coord(
        chunked_df=chunked_df,
        gdf_country=de_uk_fr)
peak memory: 604.96 MiB, increment: 0.17 MiB
CPU times: user 533 ms, sys: 340 ms, total: 873 ms
Wall time: 1.13 s

Store intermediate data

In [65]:
if not skip_read_from_raw:
    with open(OUTPUT / 'pickles' / 'user_guids_sets_raw.pkl', 'wb') as handle:
        pickle.dump(
            USER_GUIDS_SETS,
            handle,
            protocol=pickle.HIGHEST_PROTOCOL)

Size:

In [66]:
user_guids_raw = Path(OUTPUT / 'pickles' / int_data).stat().st_size / (1024*1024)
print(f"Size: {user_guids_raw:.2f} MB")
Size: 1.13 MB

Load intermediate data:

In [67]:
with open(OUTPUT / 'pickles' / int_data, 'rb') as handle:
    USER_GUIDS_SETS = pickle.load(handle)

Output total number of distinct users per country:

In [68]:
for ix, (country, user_guids_set) in enumerate(USER_GUIDS_SETS.items()):
    raw_count = len(user_guids_set)
    hll_count = list(distinct_users_total.values())[ix]
    difference = abs(raw_count-hll_count)
    print(
        f'{raw_count} total users in {country} '
        f'({hll_count} estimated hll, '
        f'{difference/(raw_count/100):.02f}% total error)')
21597 total users in Germany (24318 estimated hll, 12.60% total error)
32452 total users in United Kingdom (31290 estimated hll, 3.58% total error)
25953 total users in France (24947 estimated hll, 3.88% total error)

Collect raw metrics

Apply python set operations:

A.union(B)
A | B | C
A.intersection(B)
A & B & C
A.difference(B)
A - B - C
In [69]:
ABC_raw = len(USER_GUIDS_SETS["Germany"] & USER_GUIDS_SETS["United Kingdom"] & USER_GUIDS_SETS["France"])
ABC_raw
Out[69]:
3297
In [70]:
ABc_raw = len(USER_GUIDS_SETS["Germany"] & USER_GUIDS_SETS["United Kingdom"]) - ABC_raw
ABc_raw
Out[70]:
2555
In [71]:
aBC_raw = len(USER_GUIDS_SETS["United Kingdom"] & USER_GUIDS_SETS["France"]) - ABC_raw
aBC_raw
Out[71]:
5216
In [72]:
AbC_raw = len(USER_GUIDS_SETS["Germany"] & USER_GUIDS_SETS["France"]) - ABC_raw
AbC_raw
Out[72]:
3099
In [73]:
Abc_raw = len(USER_GUIDS_SETS["Germany"]) - ABc_raw - AbC_raw + ABC_raw
Abc_raw
Out[73]:
19240
In [74]:
aBc_raw = len(USER_GUIDS_SETS["United Kingdom"]) - ABc_raw - aBC_raw + ABC_raw
aBc_raw
Out[74]:
27978
In [75]:
abC_raw = len(USER_GUIDS_SETS["France"]) - aBC_raw - AbC_raw + ABC_raw
abC_raw
Out[75]:
20935

Update Venn Diagram

With raw measurements and error rate for hll measurements

In [76]:
# figure with subplot (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize=(10, 30))
subset_sizes_raw = [
    Abc_raw, aBc_raw, ABc_raw, abC_raw, AbC_raw, aBC_raw, ABC_raw]
colors = [
    color_de, color_uk, color_fr]
names = [
    'Germany', 'United Kingdom','France']
plot_venn(
    subset_sizes=subset_sizes_raw,
    colors=colors,
    names=names,
    title="Common User Count (Raw)",
    ax=ax[0])
plot_venn(
    subset_sizes=subset_sizes,
    colors=colors,
    names=names,
    title="Common User Count (Hll)",
    ax=ax[1])
# adjust horizontal spacing
fig.tight_layout(pad=3.0)

Final plot (Map & Venn Diagram)

For the final plot, add raw counts and error rates as additional annotation to Venn Diagram.

In [77]:
# figure with subplot (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize=(10, 24))
names = [
    'DE', 'UK','FR']  
total_sizes = [
    (len(user_guids_set), list(distinct_users_total.values())[ix])
    for ix, user_guids_set in enumerate(USER_GUIDS_SETS.values())
    ]
plot_map(
    grid=grid, sel_grids=sel_grids, 
    sel_colors=sel_colors,
    ax=ax[0])
plot_venn(
    subset_sizes=subset_sizes,
    subset_sizes_raw=subset_sizes_raw,
    total_sizes=total_sizes,
    colors=colors,
    names=names,
    ax=ax[1])
# add annotations for inner error labels
# this is fuzzy work
label_pos = {
    '110': (10, 50),
    '101': (-90, -50),
    '011': (90, -50),
    '111': (-110, 10),
}
label_rad = {
    '110': 0.1,
    '101': 0.5,
    '011': 0.3,
    '111': -0.1,
}
arr_off = {
    '110': [0, -0.05],
    '101': [0.05, 0.05],
    '011': [-0.07, 0],
    '111': [0.15, 0],
}
for ix, label_id in enumerate(
    [None, None, '110', None, '101', '011', '111']):
    if not label_id:
        continue
    dif_abs = subset_sizes[ix] - subset_sizes_raw[ix]
    dif_perc = dif_abs / (subset_sizes_raw[ix] / 100)
    label = f'{dif_perc:+.1f}%'
    
    ax[1].annotate(
        label,
        xy=v.get_label_by_id(
            label_id).get_position() - np.array(arr_off[label_id]),
        xytext=label_pos[label_id],
        ha='center',
        textcoords='offset points',
        bbox=dict(
            boxstyle='round,pad=0.5',
            fc='gray',
            alpha=0.1),
        arrowprops=dict(
            arrowstyle='->', 
            connectionstyle=f'arc3,rad={label_rad[label_id]}',
            color='gray'))
# store as png
fig.savefig(
    OUTPUT / "figures" / "figure_2_intersection.png", dpi=300, format='PNG',
    bbox_inches='tight', pad_inches=1)
plt.show()

Close DB connection & Create notebook HTML

In [78]:
db_connection.close ()
In [79]:
!jupyter nbconvert --to html_toc \
    --output-dir=../out/html ./04_interpretation.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False # create single output file
[NbConvertApp] Converting notebook ./04_interpretation.ipynb to html_toc
/opt/conda/envs/jupyter_env/lib/python3.8/site-packages/nbconvert/filters/datatypefilter.py:39: UserWarning: Your element with mimetype(s) dict_keys([]) is not able to be represented.
  warn("Your element with mimetype(s) {mimetypes}"
[NbConvertApp] Writing 7446761 bytes to ../out/html/04_interpretation.html

Create release file

Create a release file that contains ipynb notebooks, HTML, figures and python converted files, and the benchmark data (yfcc_all_est_benchmark.csv).

Check that 7z is available (apt-get install p7zip-full)

In [80]:
!cd .. && 7z a -tzip out/release_v0.1.1.zip \
    md/* py/* out/html/* out/figures/* notebooks/01_preparations.ipynb \
    notebooks/02_yfcc_gridagg_raw.ipynb \
    notebooks/03_yfcc_gridagg_hll.ipynb \
    notebooks/04_interpretation.ipynb \
    README.md jupytext.toml nbconvert.tpl \
    out/csv/yfcc_all_est_benchmark.csv \
    -x!py/__pycache__ -x!py/modules/__pycache__ -x!py/modules/.ipynb_checkpoints
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (306F0),ASM,AES-NI)

Scanning the drive:
  0M Sca        1 folder, 45 files, 62797729 bytes (60 MiB)

Creating archive: out/release_v0.1.1.zip

Items to compress: 46

      3% 14 + out/csv/yfcc_all_est_benchmark.c                                            7% 17 + out/figures/postcount_sample.p                                         12% 22 + out/figures/userdays_sample_est.pn                                             15% 24 + out/figures/yfcc_postcount_est.p                                           20% 26 + out/figures/yfcc_usercount_est.p                                           24% 28 + out/figures/yfcc_userdays_est.pn                                           28% 30 + out/html/02_yfcc_gridagg_raw.htm                                           32% 30 + out/html/02_yfcc_gridagg_raw.htm                                           35% 30 + out/html/02_yfcc_gridagg_raw.htm                                           39% 31 + out/html/03_yfcc_gridagg_hll.htm                                           43% 32 + out/html/04_interpretation.htm                                         47% 33 + out/html/Figure2_outlier_analysis.ht                                               50% 34 + out/html/Privacy_test_alex.htm                                         53% 36 + out/html/yfcc_usercount_est.ht                                         57% 36 + out/html/yfcc_usercount_est.ht                                         62% 36 + out/html/yfcc_usercount_est.ht                                         67% 36 + out/html/yfcc_usercount_est.ht                                         71% 36 + out/html/yfcc_usercount_est.ht                                         75% 36 + out/html/yfcc_usercount_est.ht                                         80% 36 + out/html/yfcc_usercount_est.ht                                         83% 45 + py/modules/tools.p                             85% 45 + py/modules/tools.p                             87% 45 + py/modules/tools.p                             90% 45 + py/modules/tools.p                             92% 45 + py/modules/tools.p                             94% 45 + py/modules/tools.p                             95% 45 + py/modules/tools.p                             96% 45 + py/modules/tools.p                             97% 45 + py/modules/tools.p                             98% 45 + py/modules/tools.p                             99% 45 + py/modules/tools.p                            
Files read from disk: 45
Archive size: 28297418 bytes (27 MiB)
Everything is Ok
In [ ]: