Summary: Aggregation of chi values per country

Alexander Dunkel, TU Dresden, Institute of Cartography; Maximilian Hartmann, Universität Zürich (UZH), Geocomputation


•••
Out[1]:

Last updated: Jan-17-2023, Carto-Lab Docker Version 0.9.0

Introduction

In this notebook, aggregate data per grid bin is used to generate summary data (chi square) per country. We'll use the 50 km grid data, to reduce errors from MAUP. Our goal is to see whether some countries feature a bias towards either sunset or sunrise, to support discussion of possible context factors.

Preparations

Load dependencies

Import code from other jupyter notebooks, synced to *.py with jupytext:

In [2]:
import sys
from pathlib import Path
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
# import all previous chained notebooks
from _04_combine import *
Chromedriver loaded. Svg output enabled.

Load additional dependencies

In [3]:
import requests, zipfile, io

Parameters

Activate autoreload of changed python files:

In [4]:
%load_ext autoreload
%autoreload 2

Via gp.datasets.get_path('naturalearth_lowres'), country geometries area easily available. However, these do not include separated spatial region subunits, which would combine all overseas regions of e.g. France together. In the admin-0 natural earth subunits dataset, these subunit areas are available.

In [5]:
NE_PATH = Path.cwd().parents[0] / "resources" / "naturalearth"
NE_URI = "https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/cultural/"
NE_FILENAME = "ne_50m_admin_0_map_subunits.zip"

Country aggregation

Load grid geometry

In [6]:
grid_empty = create_grid_df(
    grid_size=50000)
grid_empty = grid_to_gdf(grid_empty)

Load country geometry

In [7]:
def get_zip_extract(
    uri: str, filename: str, output_path: Path, 
    create_path: bool = True, skip_exists: bool = True,
    report: bool = False):
    """Get Zip file and extract to output_path.
    Create Path if not exists."""
    if create_path:
        output_path.mkdir(
            exist_ok=True)
    if skip_exists and Path(
        output_path / filename.replace(".zip", ".shp")).exists():
        if report:
            print("File already exists.. skipping download..")
        return
    r = requests.get(f'{uri}{filename}', stream=True)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(output_path)
In [8]:
get_zip_extract(
    uri=NE_URI,
    filename=NE_FILENAME,
    output_path=NE_PATH,
    report=True)
File already exists.. skipping download..

Read country shapefile to GeoDataFrame:

In [9]:
world = gp.read_file(
    NE_PATH / NE_FILENAME.replace(".zip", ".shp"))
In [10]:
world = world.to_crs(CRS_PROJ)
In [11]:
COLUMNS_KEEP = ['geometry','ADM0_A3','SOV_A3','ADMIN','SOVEREIGNT', 'ISO_A3', 'SU_A3']
In [12]:
def drop_cols_except(df: pd.DataFrame, columns_keep: List[str] = COLUMNS_KEEP):
    """Drop all columns from DataFrame except those specified in cols_except"""
    df.drop(
        df.columns.difference(columns_keep), axis=1, inplace=True)
In [13]:
drop_cols_except(world)
In [14]:
world.head()
Out[14]:
SOVEREIGNT SOV_A3 ADMIN ADM0_A3 SU_A3 ISO_A3 geometry
0 Zimbabwe ZWE Zimbabwe ZWE ZWE ZWE POLYGON ((2987278.542 -2742733.921, 2979383.40...
1 Zambia ZMB Zambia ZMB ZMB ZMB POLYGON ((2976200.722 -1924957.705, 2961959.54...
2 Yemen YEM Yemen YEM YEM YEM MULTIPOLYGON (((5181525.454 2047361.573, 51352...
3 Yemen YEM Yemen YEM YES -99 POLYGON ((5307347.563 1557616.990, 5313584.814...
4 Vietnam VNM Vietnam VNM VNM VNM MULTIPOLYGON (((10323687.558 1282070.654, 1032...
In [15]:
world[world["SOVEREIGNT"] == "France"].head(10)
Out[15]:
SOVEREIGNT SOV_A3 ADMIN ADM0_A3 SU_A3 ISO_A3 geometry
204 France FR1 France FRA FXC -99 POLYGON ((783658.015 5100500.165, 782655.467 5...
205 France FR1 France FRA FXX FRA MULTIPOLYGON (((597064.157 5619098.479, 587921...
206 France FR1 France FRA MYT MYT POLYGON ((4456352.999 -1599245.685, 4450078.32...
207 France FR1 France FRA REU REU POLYGON ((5351998.016 -2615032.781, 5337783.57...
208 France FR1 France FRA MTQ MTQ POLYGON ((-5975639.717 1784830.421, -5977605.9...
209 France FR1 France FRA GLP GLP MULTIPOLYGON (((-5993924.853 1996500.313, -600...
210 France FR1 France FRA GUF GUF POLYGON ((-5471007.672 287661.361, -5469831.69...
211 France FR1 Saint Pierre and Miquelon SPM SPM SPM MULTIPOLYGON (((-4445751.267 5530331.017, -444...
212 France FR1 Wallis and Futuna WLF WLF WLF MULTIPOLYGON (((-17360004.526 -1642824.256, -1...
213 France FR1 Saint Martin MAF MAF MAF POLYGON ((-6120916.004 2219901.617, -6131782.6...
In [16]:
world.plot()
Out[16]:
<AxesSubplot:>

Define column to use for country-aggregation:

In [17]:
COUNTRY_COL = 'SU_A3'

These SU_A3 country codes are extended ISO codes, see this ref table.

Only keep COUNTRY_COL column, SOVEREIGNT, and geometry:

In [18]:
columns_keep = ['geometry', 'SOVEREIGNT', COUNTRY_COL]
drop_cols_except(world, columns_keep)

Country overlay with grid

First, write multi-index to columns, to later re-create the index:

In [19]:
grid_empty['xbin'] = grid_empty.index.get_level_values(0)
grid_empty['ybin'] = grid_empty.index.get_level_values(1)

Create an overlay, only stroing country-grid intersection:

In [20]:
%%time
grid_overlay = gp.overlay(
    grid_empty, world, 
    how='intersection')
CPU times: user 1min 10s, sys: 196 ms, total: 1min 10s
Wall time: 1min 10s
In [21]:
grid_overlay[grid_overlay[COUNTRY_COL] == "DEU"].head()
Out[21]:
xbin ybin SOVEREIGNT SU_A3 geometry
37091 409904 6079952 Germany DEU POLYGON ((459904.000 6031192.645, 457473.846 6...
37092 409904 6029952 Germany DEU POLYGON ((459904.000 5979952.000, 439777.346 5...
37093 409904 5979952 Germany DEU POLYGON ((459904.000 5979952.000, 459904.000 5...
37094 459904 6179952 Germany DEU POLYGON ((509904.000 6129952.000, 495314.233 6...
37095 459904 6129952 Germany DEU MULTIPOLYGON (((509904.000 6079952.000, 502808...
In [22]:
grid_overlay[
    grid_overlay[COUNTRY_COL].isin(["DEU", "FXX"])].plot(
    edgecolor='white', column=COUNTRY_COL, linewidth=0.3)
Out[22]:
<AxesSubplot:>

Calculate area:

In [23]:
grid_overlay["area"] = grid_overlay.area
In [24]:
grid_overlay[grid_overlay[COUNTRY_COL] == "DEU"].head()
Out[24]:
xbin ybin SOVEREIGNT SU_A3 geometry area
37091 409904 6079952 Germany DEU POLYGON ((459904.000 6031192.645, 457473.846 6... 3.886671e+08
37092 409904 6029952 Germany DEU POLYGON ((459904.000 5979952.000, 439777.346 5... 1.891485e+08
37093 409904 5979952 Germany DEU POLYGON ((459904.000 5979952.000, 459904.000 5... 2.931637e+08
37094 459904 6179952 Germany DEU POLYGON ((509904.000 6129952.000, 495314.233 6... 2.813983e+08
37095 459904 6129952 Germany DEU MULTIPOLYGON (((509904.000 6079952.000, 502808... 2.012630e+08
In [25]:
grid_overlay.groupby(["xbin", "ybin"], sort=False).head()
Out[25]:
xbin ybin SOVEREIGNT SU_A3 geometry area
0 -17640096 -1970048 Fiji FJI POLYGON ((-17596068.762 -1988932.933, -1759674... 1.375008e+07
1 -17590096 -2020048 Fiji FJI MULTIPOLYGON (((-17558604.458 -2070048.000, -1... 2.640252e+08
2 -17590096 -2070048 Fiji FJI POLYGON ((-17544904.267 -2070048.000, -1754482... 1.000075e+08
3 -17490096 -2070048 Fiji FJI POLYGON ((-17440096.000 -2113755.099, -1744249... 1.220128e+07
4 -17440096 -2070048 Fiji FJI POLYGON ((-17435352.110 -2120048.000, -1743544... 4.561355e+07
... ... ... ... ... ... ...
70868 17009904 929952 Marshall Islands MHL POLYGON ((17059904.000 883328.346, 17059165.99... 7.021293e+05
70869 17059904 929952 Marshall Islands MHL POLYGON ((17064020.522 879952.000, 17059904.00... 2.124275e+07
70870 17059904 879952 Marshall Islands MHL POLYGON ((17070021.406 879952.000, 17079826.49... 7.197785e+07
70871 17109904 879952 Marshall Islands MHL POLYGON ((17119886.235 867969.633, 17127193.92... 3.998398e+07
70872 16709904 -20048 Nauru NRU POLYGON ((16730605.928 -68100.872, 16728352.51... 2.785668e+07

70873 rows × 6 columns

Next steps:

  • group by xbin/ybin and select max area per group
  • get max id from area comparison per group
  • select ISO_A3 column to assign values back (based on max area per bin)
In [26]:
idx_maxarea = grid_overlay.groupby(
    ["xbin", "ybin"], sort=False)['area'].idxmax()
In [27]:
idx_maxarea.head()
Out[27]:
xbin       ybin    
-17640096  -1970048    0
-17590096  -2020048    1
           -2070048    2
-17490096  -2070048    3
-17440096  -2070048    4
Name: area, dtype: int64
In [28]:
bin_adm_maxarea = grid_overlay.loc[
    idx_maxarea, ["xbin", "ybin", COUNTRY_COL]]

Recreate index (Note: duplicate xbin/ybin indexes exist):

In [29]:
bin_adm_maxarea.set_index(
    ['xbin', 'ybin'], inplace=True)
In [30]:
bin_adm_maxarea.head()
Out[30]:
SU_A3
xbin ybin
-17640096 -1970048 FJI
-17590096 -2020048 FJI
-2070048 FJI
-17490096 -2070048 FJI
-17440096 -2070048 FJI

Assign back to grid:

In [31]:
grid_empty.loc[
    bin_adm_maxarea.index,
    COUNTRY_COL] = bin_adm_maxarea[COUNTRY_COL]

Set nan to Empty class

In [32]:
grid_empty.loc[
    grid_empty[COUNTRY_COL].isna(),
    COUNTRY_COL] = "Empty"
In [33]:
grid_empty[grid_empty[COUNTRY_COL] != "Empty"].head()
Out[33]:
geometry xbin ybin SU_A3
xbin ybin
-17640096 -1970048 POLYGON ((-17640096.000 -1970048.000, -1759009... -17640096 -1970048 FJI
-17590096 -2020048 POLYGON ((-17590096.000 -2020048.000, -1754009... -17590096 -2020048 FJI
-2070048 POLYGON ((-17590096.000 -2070048.000, -1754009... -17590096 -2070048 FJI
-17540096 -1720048 POLYGON ((-17540096.000 -1720048.000, -1749009... -17540096 -1720048 WLF
-17490096 -570048 POLYGON ((-17490096.000 -570048.000, -17440096... -17490096 -570048 KIR

Check assignment

In [34]:
fig, ax = plt.subplots(1, 1, figsize=(10,12))
bbox_italy = (
    7.8662109375, 36.24427318493909,
    19.31396484375, 43.29320031385282)
buf = 1000000
# create bounds from WGS1984 italy and project to Mollweide
minx, miny = PROJ_TRANSFORMER.transform(
    bbox_italy[0], bbox_italy[1])
maxx, maxy = PROJ_TRANSFORMER.transform(
    bbox_italy[2], bbox_italy[3])
ax.set_xlim(minx-buf, maxx+buf)
ax.set_ylim(miny-buf, maxy+buf)
empty = grid_empty[grid_empty[COUNTRY_COL] == "Empty"].plot(
    ax=ax, edgecolor=None, facecolor='white', linewidth=0)
base = grid_empty[grid_empty[COUNTRY_COL] != "Empty"].plot(
    ax=ax, edgecolor='white', column=COUNTRY_COL, linewidth=0.3)
world.plot(
    ax=base, color='none', edgecolor='black', linewidth=0.2)
Out[34]:
<AxesSubplot:>
In [35]:
grid_empty.head()
Out[35]:
geometry xbin ybin SU_A3
xbin ybin
-18040096 8979952 POLYGON ((-18040096.000 8979952.000, -17990096... -18040096 8979952 Empty
8929952 POLYGON ((-18040096.000 8929952.000, -17990096... -18040096 8929952 Empty
8879952 POLYGON ((-18040096.000 8879952.000, -17990096... -18040096 8879952 Empty
8829952 POLYGON ((-18040096.000 8829952.000, -17990096... -18040096 8829952 Empty
8779952 POLYGON ((-18040096.000 8779952.000, -17990096... -18040096 8779952 Empty

Combine in a single method:

In [36]:
def grid_assign_country(
    grid: gp.GeoDataFrame, countries: gp.GeoDataFrame,
    country_col: Optional[str] = COUNTRY_COL):
    """Assign countries code based on max area overlay to grid"""
    # get index as column
    grid['xbin'] = grid.index.get_level_values(0)
    grid['ybin'] = grid.index.get_level_values(1)
    # intersect
    grid_overlay = gp.overlay(
        grid, countries, 
        how='intersection')
    # calculate area
    grid_overlay["area"] = grid_overlay.area
    # select indexes based on area overlay
    idx_maxarea = grid_overlay.groupby(
        ["xbin", "ybin"], sort=False)["area"].idxmax()
    bin_country_maxarea = grid_overlay.loc[
        idx_maxarea, ["xbin", "ybin", country_col]]
    # recreate grid index
    bin_country_maxarea.set_index(
        ['xbin', 'ybin'], inplace=True)
    # assign country back to grid
    grid.loc[
        bin_country_maxarea.index,
        country_col] = bin_country_maxarea[country_col]
    # drop index columns not needed anymore
    grid.drop(
        ["xbin", "ybin"], axis=1, inplace=True)

Optional: TFIDF country data

For exploring TFIDF per country, we'll also export country data here for PostGIS

Grid to Postgis

For further work, we'll export the SQL syntax here to import the 100km Grid to to Postgis (optional). In addition,

CREATE TABLE spatial. "grid100km" (
    xbin int,
    ybin int,
    PRIMARY KEY (xbin, ybin),
    su_a3 char(3),
    geometry geometry(Polygon, 54009)
);
In [37]:
grid_empty[grid_empty[COUNTRY_COL] == "USB"].plot()
Out[37]:
<AxesSubplot:>
In [38]:
from shapely.wkt import dumps as wkt_dumps
    
def add_wkt_col(gdf: gp.GeoDataFrame):
    """Converts gdf.geometry to WKT (Well-Known-Text) as new column
    Requires `from shapely.wkt import dumps as wkt_dumps`"""
    gdf.loc[gdf.index, "geom_wkt"] = gdf["geometry"].apply(lambda x: wkt_dumps(x))
In [39]:
add_wkt_col(grid_empty)

value_list = ',\n'.join(
    f"({index[0]},{index[1]}, '{row[COUNTRY_COL]}', 'SRID=54009;{row.geom_wkt}')"
    for index, row in grid_empty[grid_empty[COUNTRY_COL] != "Empty"].iterrows())
with open(OUTPUT / "csv" / "grid_100km_Mollweide_WKT.sql", 'w') as wkt_file:
    wkt_file.write(
        f'''
        INSERT INTO spatial."grid100km" (
            xbin,ybin,su_a3,geometry)
        VALUES
        {value_list};
        ''')

Country (su_a3) to Postgis

CREATE TABLE spatial. "country_sua3" (
    su_a3 char(3),
    PRIMARY KEY (su_a3),
    geometry geometry(Geometry, 54009)
);
In [40]:
world_empty = world.copy().set_index("SU_A3").drop(columns=['SOVEREIGNT'])
add_wkt_col(world_empty)
In [41]:
world_empty.head()
Out[41]:
geometry geom_wkt
SU_A3
ZWE POLYGON ((2987278.542 -2742733.921, 2979383.40... POLYGON ((2987278.5419494602829218 -2742733.92...
ZMB POLYGON ((2976200.722 -1924957.705, 2961959.54... POLYGON ((2976200.7224715156480670 -1924957.70...
YEM MULTIPOLYGON (((5181525.454 2047361.573, 51352... MULTIPOLYGON (((5181525.4542204160243273 20473...
YES POLYGON ((5307347.563 1557616.990, 5313584.814... POLYGON ((5307347.5630814759060740 1557616.990...
VNM MULTIPOLYGON (((10323687.558 1282070.654, 1032... MULTIPOLYGON (((10323687.5583364013582468 1282...
In [42]:
value_list = ',\n'.join(
    f"('{index}', 'SRID=54009;{row.geom_wkt}')"
    for index, row in world_empty.iterrows())
with open(OUTPUT / "csv" / "country_sua3_Mollweide_WKT.sql", 'w') as wkt_file:
    wkt_file.write(
        f'''
        INSERT INTO spatial. "country_sua3" (
            su_a3,geometry)
        VALUES
        {value_list};
        ''')

Load benchmark data

In [43]:
grid = pd.read_pickle(
    OUTPUT / f"pickles_50km" / "flickr_userdays_all_est_hll.pkl")
In [44]:
grid["userdays_hll"].dropna().head()
Out[44]:
xbin       ybin   
-18040096   29952     \x138b4006221e8426e32a853ee4512366876f017dc384...
           -20048                                          \x138b4038c3
           -70048                                          \x138b402f41
-17840096   129952                                         \x138b401ae1
           -20048                                          \x138b403fe2
Name: userdays_hll, dtype: object
In [45]:
grid.dropna().head()
Out[45]:
userdays_hll geometry
xbin ybin
-18040096 29952 \x138b4006221e8426e32a853ee4512366876f017dc384... POLYGON ((-18040096.000 29952.000, -17990096.0...
-20048 \x138b4038c3 POLYGON ((-18040096.000 -20048.000, -17990096....
-70048 \x138b402f41 POLYGON ((-18040096.000 -70048.000, -17990096....
-17840096 129952 \x138b401ae1 POLYGON ((-17840096.000 129952.000, -17790096....
-20048 \x138b403fe2 POLYGON ((-17840096.000 -20048.000, -17790096....

Assign Countries to grid:

The actual assignment takes pretty long. We will later store the assignment (bin-idx, country-idx) in a pickle that can be loaded.

In [46]:
%%time
grid_assign_country(
    grid, world, country_col=COUNTRY_COL)
CPU times: user 1min 12s, sys: 95.7 ms, total: 1min 12s
Wall time: 1min 12s
In [47]:
grid.head()
Out[47]:
userdays_hll geometry SU_A3
xbin ybin
-18040096 8979952 NaN POLYGON ((-18040096.000 8979952.000, -17990096... NaN
8929952 NaN POLYGON ((-18040096.000 8929952.000, -17990096... NaN
8879952 NaN POLYGON ((-18040096.000 8879952.000, -17990096... NaN
8829952 NaN POLYGON ((-18040096.000 8829952.000, -17990096... NaN
8779952 NaN POLYGON ((-18040096.000 8779952.000, -17990096... NaN
In [48]:
grid.plot(edgecolor='white', column=COUNTRY_COL, figsize=(22,28), linewidth=0.3)
Out[48]:
<AxesSubplot:>

Merge hll sets per country and estimate cardinality

Merge hll sets per country id, connect to hll worker db:

In [49]:
DB_CONN = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT ,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASS
)
DB_CONN.set_session(
    readonly=True)
DB_CALC = tools.DbConn(
    DB_CONN)
CUR_HLL = DB_CONN.cursor()
In [50]:
db_conn = tools.DbConn(DB_CONN)
db_conn.query("SELECT 1;")
Out[50]:
?column?
0 1

HLL Union

In [51]:
def union_hll(
    hll_series: pd.Series, db_conn: tools.DbConn, cardinality: bool = True,
    group_by: Optional[pd.Series] = None) -> pd.Series:
    """HLL Union and (optional) cardinality estimation from series of hll sets
    based on group by composite index.

    Args:
        hll_series: Indexed series (bins) of hll sets. 
        cardinality: If True, returns cardinality (counts). Otherwise,
            the unioned hll set will be returned.
        group_by: Optional Provide Series to group hll sets by. If None,
            Index will be used.
            
    The method will combine all groups of hll sets first,
        in a single SQL command. Union of hll hll-sets belonging 
        to the same group (bin) and (optionally) returning the cardinality 
        (the estimated count) per group will be done in postgres.
    
    By utilizing Postgres´ GROUP BY (instead of, e.g. doing 
        the group with numpy), it is possible to reduce the number
        of SQL calls to a single run, which saves overhead 
        (establishing the db connection, initializing the SQL query 
        etc.). Also note that ascending integers are used for groups,
        instead of their full original bin-ids, which also reduces
        transfer time.
    
    cardinality = True should be used when calculating counts in
        a single pass.
        
    cardinality = False should be used when incrementally union
        of hll sets is required, e.g. due to size of input data.
        In the last run, set to cardinality = True.
    """
    if group_by is None:
        group_by_series = hll_series.index
    else:
        group_by_series = group_by
    # group all hll-sets per index (bin-id)
    series_grouped = hll_series.groupby(
        group_by_series).apply(list)
    # From grouped hll-sets,
    # construct a single SQL Value list;
    # if the following nested list comprehension
    # doesn't make sense to you, have a look at
    # spapas.github.io/2016/04/27/python-nested-list-comprehensions/
    # with a decription on how to 'unnest'
    # nested list comprehensions to regular for-loops
    hll_values_list = ",".join(
        [f"({ix}::int,'{hll_item}'::hll)" 
         for ix, hll_items
         in enumerate(series_grouped.values.tolist())
         for hll_item in hll_items])
    # Compilation of SQL query,
    # depending on whether to return the cardinality
    # of unioned hll or the unioned hll
    return_col = "hll_union"
    hll_calc_pre = ""
    hll_calc_tail = "AS hll_union"
    if cardinality:
        # add sql syntax for cardinality 
        # estimation
        # (get count distinct from hll)
        return_col = "hll_cardinality"
        hll_calc_pre = "hll_cardinality("
        hll_calc_tail = ")::int"
    db_query = f"""
        SELECT sq.{return_col} FROM (
            SELECT s.group_ix,
                   {hll_calc_pre}
                   hll_union_agg(s.hll_set)
                   {hll_calc_tail}
            FROM (
                VALUES {hll_values_list}
                ) s(group_ix, hll_set)
            GROUP BY group_ix
            ORDER BY group_ix ASC) sq
        """
    df = db_conn.query(db_query)
    # to merge values back to grouped dataframe,
    # first reset index to ascending integers
    # matching those of the returned df;
    # this will turn series_grouped into a DataFrame;
    # the previous index will still exist in column 'index'
    df_grouped = series_grouped.reset_index()
    # drop hll sets not needed anymore
    df_grouped.drop(columns=[hll_series.name], inplace=True)
    # append hll_cardinality counts 
    # using matching ascending integer indexes
    df_grouped.loc[df.index, return_col] = df[return_col]
    # set index back to original bin-ids
    df_grouped.set_index(group_by_series.name, inplace=True)
    # return column as indexed pd.Series
    return df_grouped[return_col]
In [52]:
%%time
cardinality_series = union_hll(
    hll_series=grid["userdays_hll"].dropna(),
    group_by=grid[COUNTRY_COL],
    db_conn=db_conn)
CPU times: user 225 ms, sys: 64.8 ms, total: 290 ms
Wall time: 1.08 s
In [53]:
cardinality_series.head()
Out[53]:
SU_A3
ABW    6548
ACA    4837
ACB     207
AFG    7472
AGO    4243
Name: hll_cardinality, dtype: int64

Assign HLL Cardinality to Countries

In [54]:
world.set_index(COUNTRY_COL, inplace=True)
In [55]:
world.loc[
    cardinality_series.index,
    "userdays_est"] = cardinality_series

Calculate area and normalize:

In [56]:
world["area"] = world.area
world["userdays_est_norm"] = ((world.userdays_est ** 2) / world.area)
In [57]:
world.head()
Out[57]:
SOVEREIGNT geometry userdays_est area userdays_est_norm
SU_A3
ZWE Zimbabwe POLYGON ((2987278.542 -2742733.921, 2979383.40... 4969.0 3.910570e+11 0.000063
ZMB Zambia POLYGON ((2976200.722 -1924957.705, 2961959.54... 8066.0 7.572867e+11 0.000086
YEM Yemen MULTIPOLYGON (((5181525.454 2047361.573, 51352... 2950.0 4.516298e+11 0.000019
YES Yemen POLYGON ((5307347.563 1557616.990, 5313584.814... 346.0 3.526072e+09 0.000034
VNM Vietnam MULTIPOLYGON (((10323687.558 1282070.654, 1032... 214834.0 3.296746e+11 0.139998

Preview plot:

In [58]:
fig, ax = plt.subplots(1, 1, figsize=(22,28))
world.plot(
    column='userdays_est_norm',
    cmap='OrRd',
    ax=ax,
    linewidth=0.2,
    edgecolor='grey',
    legend=True,
    scheme='headtail_breaks')
Out[58]:
<AxesSubplot:>

Prepare method:

In [59]:
def merge_countries_grid(
    grid_countries: gp.GeoDataFrame, grid: gp.GeoDataFrame, 
    usecols: List[str], mask: Optional[pd.Series] = None):
    """Merge temporary country assigned grid data to grid geodataframe
    
    Args:
        grid_countries: Indexed GeoDataFrame with country data
        grid: Indexed GeoDataFrame target
        usecols: Col names to merge from countries
        mask: Optional boolean mask (pd.Series)
              to partially merge country data
    """
    for col in usecols:
        if mask is None:
            grid.loc[grid_countries.index, col] = grid_countries[col]
            continue
        grid_countries_mask = grid_countries.loc[mask, col]
        grid.loc[grid_countries_mask.index, col] = grid_countries_mask
        
def group_union_cardinality(
    countries: gp.GeoDataFrame, grid: gp.GeoDataFrame, 
    db_conn: tools.DbConn, 
    metric_hll: str = "usercount_hll",
    country_col: str = COUNTRY_COL,
    grid_countries_pickle: Optional[Path] = None):
    """Group hll sets per country and assign cardinality to countries
    
    Args:
        grid: Indexed GeoDataFrame with hll data to use in union
        countries: Country GeoDataFrame to group grid-bins and assign
            cardinality.
        country_col: Name/ID of column in country to use in group by
        metric_hll: the name of HLL column that is used in hll union
        db_conn: A (read-only) DB connection to PG HLL Worker, for HLL
            calculation.
        grid_countries_pickle: Optional path to store and load intermediate
            grid-country assignment
    """
    if grid_countries_pickle and grid_countries_pickle.exists():
        grid_countries_tmp = pd.read_pickle(
            grid_countries_pickle).to_frame()
        merge_countries_grid(
            grid_countries=grid_countries_tmp,
            grid=grid,
            usecols=[country_col])
    else:
        grid_assign_country(
            grid, countries, country_col=country_col)
        # store intermediate, to speed up later runs    
        if grid_countries_pickle:
            grid[country_col].to_pickle(
                grid_countries_pickle)
            print("Intermediate country-grid assignment written..")
    # calculate cardinality by hll union
    cardinality_series = union_hll(
        hll_series=grid[metric_hll].dropna(),
        group_by=grid[country_col],
        db_conn=db_conn)
    # set index
    if not countries.index.name == country_col:
        countries.set_index(country_col, inplace=True)
    # assign cardinality
    countries.loc[
        cardinality_series.index,
        metric_hll.replace("_hll", "_est")] = cardinality_series
In [60]:
def country_agg_frompickle(
    grid_pickle: Path, db_conn: tools.DbConn,
    ne_path: Path = NE_PATH, ne_filename: str = NE_FILENAME, 
    ne_uri: str = NE_URI, country_col: str = COUNTRY_COL,
    metric_hll: str = "usercount_hll") -> gp.GeoDataFrame:
    """Load grid pickle, load country shapefile, join cardinality to country
    and return country GeoDataFrame"""
    # prepare country gdf
    get_zip_extract(
        uri=ne_uri,
        filename=ne_filename,
        output_path=ne_path)
    world = gp.read_file(
        ne_path / ne_filename.replace(".zip", ".shp"))
    world = world.to_crs(CRS_PROJ)
    columns_keep = ['geometry', country_col]
    drop_cols_except(world, columns_keep)
    # prepare grid gdf
    grid = pd.read_pickle(
        grid_pickle)
    # check if intermediate country agg file already exists
    grid_countries_pickle = Path(
        ne_path / f'{km_size_str}_{ne_filename.replace(".zip", ".pickle")}')
    group_union_cardinality(
        world, grid, country_col=country_col, db_conn=db_conn,
        grid_countries_pickle=grid_countries_pickle, metric_hll=metric_hll)
    return world

Test:

  • for userdays
  • for usercount
In [61]:
metrics = ["userdays", "usercount"]
In [62]:
%%time
for metric in metrics:
    world = country_agg_frompickle(
        grid_pickle=OUTPUT / f"pickles_50km" / f"flickr_{metric}_all_est_hll.pkl",
        db_conn=db_conn, metric_hll=f"{metric}_hll")
    fig, ax = plt.subplots(1, 1, figsize=(22,28))
    ax.set_title(metric.capitalize())
    world.plot(
        column=f'{metric}_est',
        cmap='OrRd',
        ax=ax,
        linewidth=0.2,
        edgecolor='grey',
        legend=True,
        scheme='headtail_breaks')
CPU times: user 7.87 s, sys: 395 ms, total: 8.26 s
Wall time: 9.63 s