Relationships ¶

Alexander Dunkel, TU Dresden, Institute of Cartography; Maximilian Hartmann, Universität Zürich (UZH), Geocomputation; Ross Purves, Universität Zürich (UZH), Geocomputation

Introduction¶

In this notebook, we study different relationships between different sets of information:

bias for sunset and sunrise per grid bin
bias for instagram and flickr
bias for different metrics (userdays, usercount, postcount)

See the introduction to Correlation With Python.

TODO: Notebook cleanup

Preparations¶

Load dependencies¶

Import code from other jupyter notebooks, synced to *.py with jupytext:

import sys
from pathlib import Path
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
# import all previous chained notebooks
from _04_combine import *

Chromedriver loaded. Svg output enabled.

Activate autoreload of changed python files:

%load_ext autoreload
%autoreload 2

Load additional dependencies

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

Parameters¶

Define which metric to use for relationship study.
define if relationships are only studied looking at significant chi-square value grid bins

METRIC = 'usercount'
ONLY_SIGNIFICANT = False

The column for this metric ends with 'est', due to HLL estimates:

METRIC_COL = f"{METRIC}_est"

Load data¶

Load grid data using the chimaps_fromcsv() method. This data includes the absolute measurements.

grid = chimaps_fromcsv(
    plot=False, chi_column=METRIC_COL, normalize=False)

Drop cols not needed (chi data, summary data for merged chi values sunset and sunrise):

drop_cols = [
    'chi_value_sunset', 'chi_value_sunrise', 'chi_value', 'significant', 
    'underrepresented', 'significant', 'significant_sunset', 'significant_sunrise',
    'usercount_est_expected', 'userdays_est_expected', 'postcount_est_expected']
grid.drop(columns=drop_cols, inplace=True, errors='ignore')

grid.columns

Index(['geometry', 'postcount_est_sunset', 'usercount_est_sunset',
       'userdays_est_sunset', 'usercount_est_sunrise', 'postcount_est_sunrise',
       'userdays_est_sunrise'],
      dtype='object')

Prepare Data¶

Metric Comparison: Usercount, Userdays, Postcount¶

Single very active users may have significant influence on userdays and postcount metrics.

One question here is, how large is the influence of very active, single users on aggregate metrics per bin?

In order to highlight this influence in graphics, calculate the ratio between usercount and postcount/userdays and classify top10 ratios.

Divide the number of posts per user (postcount / usercount) per grid bin
Classify ratios (high/low)
Colorize relationship plots with classes

grid[grid['postcount_est_sunset'] >= 1000].drop(columns=['geometry']).head()

SUR_NAME_LIST = [
    '_est_sunset',
    '_est_sunrise']

Create Plots¶

Plot relationship: Chi square sunset (x) and sunrise (y) per grid bin¶

Relationship between chi square values for sunset and sunrise for different grid bins.

sns.set_theme(style="whitegrid")

Prepare annotation:

Add plot annotation for r² and p, covariance.
Adapted from source and r²

from matplotlib.lines import Line2D

def annotate(
    data, x_col, y_col, ranked=False, **kws):
    """Add r², p and covariance to plot, format legend"""
    x = data[x_col]
    y = data[y_col]
    nas = np.logical_or(x.isna(), y.isna())
    cov = None
    if not ranked:
        r, p = stats.pearsonr(
            x[~nas],
            y[~nas])
        # covariance
        cov = np.cov(data[x_col], data[y_col])[0][1]
    else:
        r, p = stats.spearmanr(
            x[~nas],
            y[~nas])     
    # r²
    correlation_matrix = np.corrcoef(data[x_col], data[y_col])
    correlation_xy = correlation_matrix[0, 1]
    r_squared = correlation_xy**2
    # update legend
    ax = plt.gca()
    handles, labels = ax.get_legend_handles_labels()
    label = ""
    if r:
        label = f"{label} \nStatistics:\nr={r:.2f},"
    if p:
        label = f"{label} \np={p:.2g},"
    if cov:
        label = f"{label} \ncov={cov:.2f}"
    if r_squared:
        label = f"{label} \nr²={r_squared:.2f}"
    patch = Line2D(
        [0], [0],
        color=None,
        linestyle="None",
        label=label)
    handles.append(patch) 
    plt.legend(
        handles=handles, loc='upper left',
        bbox_to_anchor=(1.04,1), frameon=False)

Plot

def relationship_plot(
    data: gp.GeoDataFrame,
    title: str,
    x_col: str = f'{METRIC_COL}_sunset_cbrt',
    y_col: str = f'{METRIC_COL}_sunrise_cbrt',
    x_label: str = f'{METRIC} Sunset (cube root)',
    y_label: str = f'{METRIC} Sunrise (cube root)',
    figsize: Tuple[int, int] = (7, 7),
    plot_context: str = "100 km grid bin"):
    """Create relationship plot"""
    fig, ax = plt.subplots(figsize=figsize)
    fig.suptitle(
        title,
        fontsize=12, y=0)
    scatterplot_kwarg = {
        "ax":ax,
        "edgecolors":"white",
        "linewidth":1,
        "x":x_col,
        "y":y_col,
    }

    g = sns.scatterplot(
        data=data,
        color='grey', **scatterplot_kwarg, label=plot_context)
    # get topic for y and x axis (e.g. sunset, surnise)
    y_topic = y_label.split('(')[0].split()[-1]
    x_topic = x_label.split('(')[0].split()[-1]
    # add numbers to plot
    ranked = False
    if x_col.endswith('_rank'):
        ranked=True
    annotate(
        data=data,
        x_col=x_col,
        y_col=y_col,
        ranked=ranked)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)

Relationship metrics: Covariance, Correlation Coefficients¶

Covariance¶

Covariance can be calculated with weights or without. Since we are using absolute userday frequencies, we do not use weights.

link docs

cov(a,a) cov(a,b)

cov(a,b) cov(b,b)

Without weights:

covariance = np.cov(
    grid[f'{METRIC_COL}_sunset'], grid[f'{METRIC_COL}_sunrise'])
print(covariance)

[[10180.45109038  2750.78933315]
 [ 2750.78933315   888.52663159]]

Output as a single number

print(covariance[0][1])

2750.789333146508

As expected, sunset and sunrise reactions have a positive relationship: In other words, typically, where people react to sunset, reactions to sunrise are also found, and vice versa.

A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.

Correlation Coefficients¶

Unlike covariance, Correlation Coefficients do not offer the ability to include z-values (weights).
only focus: correlation between absolute values

(Pearson correlation coefficient)

Userdays:

x = grid[f'{METRIC_COL}_sunset']
y = grid[f'{METRIC_COL}_sunrise']
r = np.corrcoef(x, y)
r

array([[1.        , 0.91461558],
       [0.91461558, 1.        ]])

Same as:

from scipy.stats import pearsonr
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.915

The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation.

Interpretation for sunset/sunrise: There's a notable relationship between the two events in that sunrise reactions tend to increase when sunset reactions increase. In other words, locations that people prefer to view sunsets also tend to feature a suitability for sunrise, as already observed with the covariance test above.

User Count:

Compare to user counts

x = grid[f"usercount_est_sunset"]
y = grid[f"usercount_est_sunrise"]
r = np.corrcoef(x, y)
display(pd.DataFrame(r))

The correlation is quite a bit stronger for usercounts, instead of userdays, indicating a higher variability of the userday measurement.

Post Count:

Compare to user counts

x = grid[f"postcount_est_sunset"]
y = grid[f"postcount_est_sunrise"]
r = np.corrcoef(x, y)
display(pd.DataFrame(r))

Surprisingly, post count correlation is higher than userday correlation, which would mean that the userday measurement has the highest variability of all measurements.

Spearman rank-order correlation:

Since we use ranked data, Spearman rank-order correlation test is more approiate here.

x = grid[f'{METRIC_COL}_sunset']
y = grid[f'{METRIC_COL}_sunrise']
corr, pval = stats.spearmanr(x, y)
print(f'Spearman correlation: {corr:.3f} with p={pval:.3f}')

Spearman correlation: 0.783 with p=0.000

Ranking Correlation

In order to focus on the relationship, not relative distribution of values, it is possible to compare ranks for countries.

def rank_series(series: pd.Series) -> pd.Series:
    """Create ranking for series (1, 2..., x)
    and return as series of numbers (int).
    """
    return series[series > 0].rank()

def rank_cols(grid: pd.DataFrame, topic1="sunset", topic2="sunrise", metric_col = METRIC_COL):
    """Create ranks for two columns in grid, store as new cols"""
    grid[f'{metric_col}_{topic1}_rank'] =  rank_series(grid[f'{metric_col}_{topic1}'])
    grid[f'{metric_col}_{topic2}_rank'] =  rank_series(grid[f'{metric_col}_{topic2}'])

Relationship between values from Sunset and Sunrise (100km grid)¶

We'll use ranked comparison for the relationship plots below.

grid_sunrise = grid_agg_fromcsv(OUTPUT / "csv" / "flickr_sunrise_est.csv")
grid_sunset = grid_agg_fromcsv(OUTPUT / "csv" / "flickr_sunset_est.csv")

grid_sunrise.head()

Calculate ranks from absolute numbers.

def rank_cols_dfs(df1, df2, topic1="sunrise", topic2="sunset", metric=METRIC):
    """Rank columns of df_sunset, df_sunrise"""
    metric_col = metric
    if metric != "chi_value":
        metric_col = f"{metric}_est"
    df1[f'{metric}_{topic1}_rank'] = rank_series(
       df1[metric_col])
    df2[f'{metric}_{topic2}_rank'] = rank_series(
       df2[metric_col])

rank_cols_dfs(grid_sunrise, grid_sunset, metric=METRIC)

Merge

def merge_df_topics(df1, df2, topic1="sunrise", topic2="sunset", metric=METRIC, ranked: bool = True) -> pd.DataFrame:
    """Merge sunset and sunrise/ flickr and instagram values"""
    _rank = ""
    if ranked:
        _rank = "_rank"
    df = df1[[f'{metric}_{topic1}{_rank}']].merge(
        df2[[f'{metric}_{topic2}{_rank}']],
        left_index=True, right_index=True)
    return df

grid = merge_df_topics(grid_sunrise, grid_sunset, metric=METRIC)

title = (
    f'Relationship between {METRIC} per grid cell (ranked) \n'
    'for Sunset and Sunrise from Flickr. ')
x_col = f'{METRIC}_sunset_rank'
y_col = f'{METRIC}_sunrise_rank'
relationship_plot(
    data=grid, title=title, x_col=x_col, y_col=y_col,
    x_label=f'{METRIC.title()} Sunset (ranked)', y_label=f'{METRIC.title()} Sunrise (ranked)')

Relationship between values from Flickr and Instagram¶

Load Data: Combine Instagram and Fickr data¶

grid_flickr = chimaps_fromcsv(
    plot=False, chi_column=METRIC_COL, normalize=False)

instagram_args = {
    "csv_observed_plus":"instagram_sunset_est.csv",
    "csv_observed_minus":"instagram_sunrise_est.csv",
    "csv_expected":"instagram_random_est.csv"}
grid_instagram = chimaps_fromcsv(
    plot=False, chi_column=METRIC_COL, normalize=False, **instagram_args)

Remove cols not needed for the relationship plots.

grid_flickr.drop(columns=drop_cols, inplace=True, errors='ignore')
grid_instagram.drop(columns=drop_cols, inplace=True, errors='ignore')

COLMAP_FLICKR = {
    f'{METRIC_COL}_sunrise':f'{METRIC_COL}_sunrise_flickr',
    f'{METRIC_COL}_sunset':f'{METRIC_COL}_sunset_flickr'}
COLMAP_INSTAGRAM = {
    f'{METRIC_COL}_sunrise':f'{METRIC_COL}_sunrise_instagram',
    f'{METRIC_COL}_sunset':f'{METRIC_COL}_sunset_instagram'}

grid_rename_cols(grid_flickr, COLMAP_FLICKR)
grid_rename_cols(grid_instagram, COLMAP_INSTAGRAM)

grid_flickr.drop(columns=['geometry']).head()

Merge both grids and rename metric columns:

merge_cols = [
    f'{METRIC_COL}_sunrise_instagram',
    f'{METRIC_COL}_sunset_instagram']
grid = merge_df(grid_flickr, grid_instagram, merge_cols)

preview_mask = grid[f'{METRIC_COL}_sunrise_instagram']>1000
grid[preview_mask].drop(columns=['geometry']).head()

Calculate rank series for userdays:

grid[f'{METRIC_COL}_sunset_flickr_rank'] = rank_series(grid[f'{METRIC_COL}_sunset_flickr'])
grid[f'{METRIC_COL}_sunset_instagram_rank'] = rank_series(grid[f'{METRIC_COL}_sunset_instagram'])
grid[f'{METRIC_COL}_sunrise_flickr_rank'] = rank_series(grid[f'{METRIC_COL}_sunrise_flickr'])
grid[f'{METRIC_COL}_sunrise_instagram_rank'] = rank_series(grid[f'{METRIC_COL}_sunrise_instagram'])

Visualize relationship Flickr/Instagram¶

grid[preview_mask].drop(columns=['geometry']).head()

Sunset

title = (
    f'Relationship between {METRIC} per grid cell (ranked) \n'
    'from Flickr and Instagram for sunset. ')
x_col = f'{METRIC_COL}_sunset_flickr_rank'
y_col = f'{METRIC_COL}_sunset_instagram_rank'
relationship_plot(
    data=grid, title=title, x_col=x_col, y_col=y_col,
    x_label=f'{METRIC.title()} Flickr (ranked)', y_label=f'{METRIC.title()} Instagram (ranked)')

Covariance including non-significant:

Covariance = np.cov(
    grid[f'{METRIC_COL}_sunset_flickr'],
    grid[f'{METRIC_COL}_sunset_instagram'])[0][1]
print(Covariance)

73131.39742788515

Sunrise

title = (
    f'Relationship between {METRIC} per grid cell (ranked) \n'
    'from Flickr and Instagram for sunrise. ')
x_col = f'{METRIC_COL}_sunrise_flickr_rank'
y_col = f'{METRIC_COL}_sunrise_instagram_rank'
relationship_plot(
    data=grid, title=title, x_col=x_col, y_col=y_col,
    x_label=f'{METRIC.title()} Flickr (ranked)', y_label=f'{METRIC.title()} Instagram (ranked)')

Covariance = np.cov(
    grid[f'{METRIC_COL}_sunrise_flickr'],
    grid[f'{METRIC_COL}_sunrise_instagram'])[0][1]
print(Covariance)

4628.057034252609

Relationships on Country aggregate¶

Instead of using 100 km bins, relationships can also be studied for country level aggregate data (chi, total, expected etc.).

Load Flickr country data for sunset/sunrise¶

def load_country_csv(
        topic: str = "sunrise", source: str = "flickr",
        metric: str = METRIC, output: Path = OUTPUT) -> pd.DataFrame:
    """Load country hll cardinalities for metric"""
    df = pd.read_csv(
        output / "csv" / f"countries_{metric}_chi_{source}_{topic}.csv",
        index_col=["SU_A3"])
    return df

df_sunrise = load_country_csv(topic="sunrise", source="flickr", metric="usercount")

df_sunrise.head()

df_sunset = load_country_csv(topic="sunset", source="flickr", metric="usercount")

df_sunset.head()

rank_cols_dfs(df_sunrise, df_sunset, metric=METRIC)

df = merge_df_topics(df_sunrise, df_sunset, metric=METRIC)

Replace NaN values with 0:

df.fillna(0, inplace=True)

df.head()

Visualize¶

f, ax = plt.subplots(figsize=(7, 7))

x_col = f'{METRIC}_sunrise_rank'
y_col = f'{METRIC}_sunset_rank'

f.suptitle(
    f'Relationship between sunset and sunrise (usercount, ranked) for Flickr',
    fontsize=12, y=0)

scatterplot_kwarg = {
    "ax":ax,
    "edgecolors":"white",
    "linewidth":1,
    "x":x_col,
    "y":y_col,
}
    
g = sns.scatterplot(
    data=df, **scatterplot_kwarg, color='grey',
    label="Country (su_a3)")

annotate(
    data=df,
    x_col=x_col,
    y_col=y_col,
    ranked=True)

ax.set_xlabel(f'Usercount Sunrise (ranked)')
ax.set_ylabel(f'Usercount Sunset (ranked)')

Text(0, 0.5, 'Usercount Sunset (ranked)')

See if there is any conglomeration for European Countries and US/Canada.

Get list of European and North America Countries

world = gp.read_file(
    gp.datasets.get_path('naturalearth_lowres'),
    crs=CRS_WGS)
world = world.to_crs(CRS_PROJ)

cont_sel = world[(world["continent"].isin(
    ["Europe"])) | (world["iso_a3"] == "USA") | (world["iso_a3"] == "CAN")]

cont_sel.plot()

<AxesSubplot:>

cont_sel.head()

ne_path = Path.cwd().parents[0] / "resources" / "naturalearth"
ne_filename = "ne_50m_admin_0_map_subunits.zip"
world_su = gp.read_file(
    ne_path / ne_filename.replace(".zip", ".shp"))
world_su = world_su.to_crs(CRS_PROJ)

def drop_cols_except(df: pd.DataFrame, columns_keep: List[str]):
    """Drop all columns from DataFrame except those specified in cols_except"""
    df.drop(
        df.columns.difference(columns_keep), axis=1, inplace=True)

columns_keep = ['geometry','ADMIN', 'SU_A3']
drop_cols_except(world_su, columns_keep)

Classify dataframe chi countries based on country list:

from geopandas.tools import sjoin
cont_sel = sjoin(
    cont_sel, world_su, 
    how='left')

For some reason, there is one outlier (French Guayana) that is manually excluded.

def spatial_join_area(df, cont_sel, area_context="Europe/North America"):
    "Classify dataframe chi countries based on country list"
    df[area_context] = np.where(
        ((df.index.isin(cont_sel["SU_A3"])) & (df.index != "BRA")), True, False)

spatial_join_area(df, cont_sel)

Standard annotate:

def annotate_countries(
    df: pd.DataFrame, x_col: str, y_col: str):
    """Annotate map based on a list of countries"""
    for idx, row in df.iterrows():
        plt.annotate(
            text=f'{idx}',
            xy=(row[x_col], row[y_col]),
            xytext=(-15, -15), textcoords='offset points',
            horizontalalignment='left',
            color="darkgrey")

There is a package callec adjust_text that tries to reduce overlapping annotations in mpl. This will take more time, however.

def annotate_countries_adjust(
    df: pd.DataFrame, x_col: str, y_col: str, ax):
    """Annotate map based on a list of countries"""
    texts = []
    for idx, row in df.iterrows():
        texts.append(
             plt.text(
                 s=f'{idx}',
                 x=row[x_col],
                 y=row[y_col],
                 horizontalalignment='center',
                 color="darkgrey"))
    adjust_text(
        texts, autoalign='y', ax=ax,
        arrowprops=dict(arrowstyle="simple, head_width=0.25, tail_width=0.05",
                        color='r', lw=0.5, alpha=0.5))

def country_rel_plot(
    df: pd.DataFrame, topic1="flickr", topic2="instagram",
    plot_context="Flickr",
    filename: str = None,
    output: Path = OUTPUT,
    metric = METRIC,
    annotate_countries: bool = None,
    mask_zero: bool = True,
    add_labels: bool = False,
    ranked: bool = True):
    """Country chi square relationship plot"""
    fig, ax = plt.subplots(figsize=(7, 7))
    
    _rank = ""
    if ranked:
        _rank = "_rank"
    x_col = f'{metric}_{topic1}{_rank}'
    y_col = f'{metric}_{topic2}{_rank}'
    
    fig.suptitle(
        f'Relationship between {metric} (ranked) for {topic1} and '
        f'{topic2} per country for {plot_context}.',
        fontsize=12, y=0)

    scatterplot_kwarg = {
        "ax":ax,
        "edgecolors":"white",
        "linewidth":1,
        "x":x_col,
        "y":y_col,
    }
    
    if annotate_countries:
        df_anot = df
        if mask_zero:
            _mask_zero = ((df[x_col] == 0) & (df[y_col] == 0))
            df_anot = df[~_mask_zero]
        g = sns.scatterplot(
            data=df_anot[df_anot["Europe/North America"] == False],
            color="grey", label="Country (su_a3)",
            **scatterplot_kwarg)
        g = sns.scatterplot(
            data=df_anot[df_anot["Europe/North America"] == True],
            color="red", label="European Countries \n+ US/Canada",
            **scatterplot_kwarg)
    else:
        g = sns.scatterplot(
            data=df,
            color="grey", label="Country (su_a3)",
            **scatterplot_kwarg)        
    kws = {
    "ax":ax, "x":x_col, "y":y_col, "s": 100, 
    "facecolors": "none", "linewidth": 0.5,
    "color":"none"}

    ax.set_xlabel(f'{metric.capitalize()} {topic1.capitalize()} {"(ranked)" if _rank else ""}')
    ax.set_ylabel(f'{metric.capitalize()} {topic2.capitalize()} {"(ranked)" if _rank else ""}')
    
    if annotate_countries and add_labels:
        annotate_countries_adjust(
            df_anot[(df_anot["Europe/North America"] == True)],
            x_col=x_col,
            y_col=y_col,
            ax=ax)    
    annotate(
        data=df,
        x_col=x_col,
        y_col=y_col,
        ranked=True)
    if filename:
        print("Storing figure as png and svg..")
        fig.savefig(
            output / f"figures" / f"{filename}.png", dpi=300, format='PNG',
            bbox_inches='tight', pad_inches=1, facecolor="white")
        # also save as svg
        fig.savefig(
            output / "svg" / f"{filename}.svg", format='svg',
            bbox_inches='tight', pad_inches=1, facecolor="white")

def annotate_locations(
    df: pd.DataFrame):
    """Annotate map based on a list of locations"""
    for idx, row in df.iterrows():
        plt.annotate(
            text=f'{idx + 1}', # row['name']
            xy=row['coords'],
            xytext=np.subtract(row['coords'], 750000),
            horizontalalignment='left')

import matplotlib.patheffects as pe
from matplotlib.lines import Line2D
from adjustText import adjust_text
def annotate_records_adjust(
    df: pd.DataFrame, ax):
    """Annotate map based on a list of records"""
    texts = []
    for idx, row in df.iterrows():
        fcolor = 'black'
        if row.metric_outlier == True:
            fcolor = 'red'
        texts.append(
            plt.text(
                s=row['name'].replace(" ", "\n"), 
                x=row['coords'][0],
                y=row['coords'][1],
                horizontalalignment='center',
                color=fcolor,
                alpha=0.8,
                fontsize=10,
                path_effects=[pe.withStroke(linewidth=4, foreground="white")]))
    adjust_text(
        texts, autoalign='y', ax=ax,
        arrowprops=dict(arrowstyle="simple, head_width=0.25, tail_width=0.05",
                        color='r', lw=0.5, alpha=0.5))

df[df["Europe/North America"]].head()

Plot map

country_rel_plot(
    df, plot_context="Flickr", topic1="sunrise", topic2="sunset", annotate_countries=True,
    filename="sunrise_sunset_relationship_countries_flickr", metric=METRIC)

Storing figure as png and svg..

Repeat for Instagram¶

sunset/sunrise

METRIC = "usercount"
load_kwds = {"topic":"sunrise", "source":"instagram"}
df_sunrise = load_country_csv(metric=METRIC, **load_kwds)

load_kwds["topic"] = "sunset"
df_sunset = load_country_csv(metric=METRIC, **load_kwds)

def rename_cols(df1, df2, topic1="sunset", topic2="sunrise", metric=METRIC):
    """Rename columns of for two topic comparison
    E.g.: sunset, sunrise; flickr, instagram
    """
    df1.rename(columns={
        f'{metric}':f'{metric}_{topic1}'}, inplace=True)
    df2.rename(columns={
        f'{metric}':f'{metric}_{topic2}'}, inplace=True)

def join_dfs_apply(df1, df2, topic1="flickr", topic2="instagram", metric=METRIC, ranked: bool = True) -> pd.DataFrame:
    """Join sunset and sunrise df chi"""
    if ranked:
        rank_cols_dfs(df1, df2, topic1, topic2, metric=metric)
    else:
        rename_cols(df1, df2, topic1, topic2, metric=metric)
    df = merge_df_topics(df1, df2, topic1, topic2, metric=metric, ranked=ranked)    
    df.fillna(0, inplace=True)
    spatial_join_area(df, cont_sel)
    return df

df = join_dfs_apply(df_sunrise, df_sunset, topic1="sunrise", topic2="sunset", metric=METRIC)

country_rel_plot(
    df, plot_context=f"{load_kwds.get('source').title()}", topic1="sunrise", topic2="sunset", annotate_countries=True,
    filename=f"sunrise_sunset_relationship_countries_instagram", metric=METRIC)

Storing figure as png and svg..

Repeat for Instagram and Flickr¶

Here, we compare reliability for results with usercount for Instagram and Flickr

METRIC = "usercount"
load_kwds = {"topic":"sunrise", "source":"flickr"}
df_flickr = load_country_csv(metric=METRIC, **load_kwds)

load_kwds["source"] = "instagram"
df_instagram = load_country_csv(metric=METRIC, **load_kwds)

Repeat the process equal to Flickr, afterwards plot:

df = join_dfs_apply(df_flickr, df_instagram, topic1="flickr", topic2="instagram", metric=METRIC)

country_rel_plot(
    df, plot_context=f"Sunrise reactions", annotate_countries=True, topic1="flickr", topic2="instagram",
    filename=f"instagram_flickr_relationship_countries_sunrise", metric=METRIC)

Storing figure as png and svg..

Repeat for Instagram/Flickr bias¶

METRIC = 'usercount'
METRIC_COL = 'usercount_est'
load_kwds = {"topic":"sunset", "source":"flickr"}
df_flickr = load_country_csv(metric="usercount", **load_kwds)

load_kwds["source"] = "instagram"
df_instagram = load_country_csv(metric="usercount", **load_kwds)

df = join_dfs_apply(
    df_flickr, df_instagram, topic1="flickr", topic2="instagram", metric=METRIC)

df.head()

country_rel_plot(
    df, topic1="flickr", topic2="instagram",
    plot_context="Sunset reactions",
    filename="instagram_flickr_relationship_countries_sunset", annotate_countries=True)

Storing figure as png and svg..

Relationships for Chi¶

Besides absolute values, also compare chi values for countries (sunset/sunrise and flickr/instagram)

METRIC = 'usercount'
METRIC_COL = 'usercount_est'
load_kwds = {"topic":"sunset", "source":"flickr"}
df_sunset = load_country_csv(metric='usercount', **load_kwds)

load_kwds["topic"] = "sunrise"
df_sunrise = load_country_csv(metric="usercount", **load_kwds)

df = join_dfs_apply(
    df_sunrise, df_sunset, topic1="sunrise", topic2="sunset", metric="chi_value", ranked=False)

df.head()

country_rel_plot(
    df, topic1="sunrise", topic2="sunset", metric='chi_value', ranked=False,
    plot_context="Chi value Flickr",
    filename="sunrise_sunset_relationship_countries_flickr_chi", annotate_countries=True)

Storing figure as png and svg..

METRIC = 'usercount'
METRIC_COL = 'usercount_est'
load_kwds = {"topic":"sunset", "source":"flickr"}
df_sunset = load_country_csv(metric='usercount', **load_kwds)

load_kwds["source"] = "instagram"
df_sunrise = load_country_csv(metric="usercount", **load_kwds)

df = join_dfs_apply(
    df_sunrise, df_sunset, topic1="sunrise", topic2="sunset", metric="chi_value", ranked=False)

country_rel_plot(
    df, topic1="sunrise", topic2="sunset", metric='chi_value', ranked=False,
    plot_context="Chi value Instagram",
    filename="sunrise_sunset_relationship_countries_instagram_chi", annotate_countries=True)

Storing figure as png and svg..

METRIC = 'usercount'
METRIC_COL = 'usercount_est'
load_kwds = {"topic":"sunrise", "source":"flickr"}
df_flickr = load_country_csv(metric='usercount', **load_kwds)

load_kwds["source"] = "instagram"
df_instagram = load_country_csv(metric="usercount", **load_kwds)

df = join_dfs_apply(
    df_flickr, df_instagram, topic1="flickr", topic2="instagram", metric="chi_value", ranked=False)

country_rel_plot(
    df, topic1="flickr", topic2="instagram", metric='chi_value', ranked=False,
    plot_context="Chi value Sunrise",
    filename="instagram_flickr_relationship_countries_sunrise_chi", annotate_countries=True)

Storing figure as png and svg..

METRIC = 'usercount'
METRIC_COL = 'usercount_est'
load_kwds = {"topic":"sunset", "source":"flickr"}
df_flickr = load_country_csv(metric='usercount', **load_kwds)

load_kwds["source"] = "instagram"
df_instagram = load_country_csv(metric="usercount", **load_kwds)

df = join_dfs_apply(
    df_flickr, df_instagram, topic1="flickr", topic2="instagram", metric="chi_value", ranked=False)

country_rel_plot(
    df, topic1="flickr", topic2="instagram", metric='chi_value', ranked=False,
    plot_context="Chi value Sunset",
    filename="instagram_flickr_relationship_countries_sunrise_chi", annotate_countries=True)

Storing figure as png and svg..

Store generated graphics as tabbed HTML¶

import ipywidgets as widgets
# dictionary with filename and title
pathrefs = {
    0: ('sunrise_sunset_relationship_countries_flickr.png', 'F Sunrise + Sunset'),
    1: ('sunrise_sunset_relationship_countries_instagram.png', 'I Sunrise + Sunset'),
    2: ('instagram_flickr_relationship_countries_sunrise.png', 'I Sunrise + F Sunrise'),
    3: ('instagram_flickr_relationship_countries_sunset.png', 'I Sunset + F Sunset'),}

widgets_images = [
    widgets.Image(
        value=open(Path('OUT') / OUTPUT / f"figures" / pathref[0], "rb").read(),
        format='png',
        width=700
     )
    for pathref in pathrefs.values()]

from ipywidgets.embed import embed_minimal_html
children = widgets_images
tab = widgets.Tab()
tab.children = children
for i in range(len(children)):
    tab.set_title(i, pathrefs[i][1])
embed_minimal_html(
    Path('OUT') / OUTPUT / f'html{km_size_str}' / 'compare_relationships.html',
    views=[tab], title=f'Relationship plots for sunset, sunrise, flickr, and instagram absolute {METRIC} for the country level.')

Create notebook HTML¶

!jupyter nbconvert --to html_toc \
    --output-dir=../out/html ./08_relationships.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&- # create single output file

Copy single HTML file to resource folder

!cp ../out/html/08_relationships.html ../resources/html/

		postcount_est_sunset	usercount_est_sunset	userdays_est_sunset	usercount_est_sunrise	postcount_est_sunrise	userdays_est_sunrise
xbin	ybin
-15340096	2779952	2946	657	1036	134	363	367
-15240096	2679952	11300	2115	4259	369	1242	1175
-15140096	2679952	2946	728	1184	369	1758	1782
	2579952	8594	1811	3696	254	1034	996
	2479952	2291	672	1194	40	97	95

	0	1
0	1.000000	0.914616
1	0.914616	1.000000

	0	1
0	1.000000	0.794686
1	0.794686	1.000000

		geometry	postcount_est	usercount_est	userdays_est
xbin	ybin
-18040096	8979952	POLYGON ((-18040096.000 8979952.000, -17940096...	0	0	0
	8879952	POLYGON ((-18040096.000 8879952.000, -17940096...	0	0	0
	8779952	POLYGON ((-18040096.000 8779952.000, -17940096...	0	0	0
	8679952	POLYGON ((-18040096.000 8679952.000, -17940096...	0	0	0
	8579952	POLYGON ((-18040096.000 8579952.000, -17940096...	0	0	0

		postcount_est_sunset	usercount_est_sunset_flickr	userdays_est_sunset	usercount_est_sunrise_flickr	postcount_est_sunrise	userdays_est_sunrise
xbin	ybin
-18040096	8979952	0	0	0	0	0	0
	8879952	0	0	0	0	0	0
	8779952	0	0	0	0	0	0
	8679952	0	0	0	0	0	0
	8579952	0	0	0	0	0	0

	usercount_est_expected	usercount_est	chi_value	significant
SU_A3
ZWE	1345.0	33.0	-7.157880	True
ZMB	2511.0	62.0	-9.523682	True
YEM	740.0	12.0	-12.732712	True
YES	93.0	4.0	3.962289	True
VNM	31437.0	597.0	-66.855308	True

	usercount_est_expected	usercount_est	chi_value	significant
SU_A3
ZWE	1345.0	158.0	10.816126	True
ZMB	2511.0	260.0	7.085320	True
YEM	740.0	38.0	-11.804480	True
YES	93.0	11.0	2.929983	False
VNM	31437.0	1442.0	-87.653996	True

	usercount_sunrise_rank	usercount_sunset_rank
SU_A3
ZWE	118.0	139.5
ZMB	154.0	163.5
YEM	72.0	73.5
YES	34.5	38.5
VNM	233.0	233.0

	pop_est	continent	name	iso_a3	gdp_md_est	geometry
3	37589262.0	North America	Canada	CAN	1736425	MULTIPOLYGON (((-9464830.276 5768352.350, -947...
4	328239523.0	North America	United States of America	USA	21433226	MULTIPOLYGON (((-9464830.276 5768352.350, -924...
18	144373535.0	Europe	Russia	RUS	1699876	MULTIPOLYGON (((8769033.872 7865244.997, 87094...
21	5347896.0	Europe	Norway	NOR	403336	MULTIPOLYGON (((505060.605 8505913.475, 506529...
43	67059887.0	Europe	France	FRA	2715518	MULTIPOLYGON (((-5168883.083 513724.093, -5231...

	usercount_sunrise_rank	usercount_sunset_rank	Europe/North America
SU_A3
VAT	0.0	0.0	True
USK	218.0	221.0	True
USH	264.0	273.0	True
USB	276.0	285.0	True
WLS	255.0	263.0	True

	usercount_flickr_rank	usercount_instagram_rank	Europe/North America
SU_A3
ZWE	139.5	112.0	False
ZMB	163.5	131.0	False
YEM	73.5	50.0	False
YES	38.5	18.5	False
VNM	233.0	239.0	False

		postcount_est_sunset	usercount_est_sunset_flickr	userdays_est_sunset	usercount_est_sunrise_flickr	postcount_est_sunrise	userdays_est_sunrise	usercount_est_sunrise_instagram	usercount_est_sunset_instagram	usercount_est_sunset_flickr_rank	usercount_est_sunset_instagram_rank	usercount_est_sunrise_flickr_rank	usercount_est_sunrise_instagram_rank
xbin	ybin
-15240096	2679952	11300	2115	4259	369	1242	1175	2663	16941	9065.0	10653.0	6487.5	8620.5
-15140096	2679952	2946	728	1184	369	1758	1782	3345	9862	8943.0	10597.0	6487.5	8646.0
-15140096	2579952	8594	1811	3696	254	1034	996	1039	8642	9056.0	10575.0	6428.0	8446.0
-15040096	2579952	1637	474	747	642	2933	2855	1721	2531	8813.5	10280.0	6539.0	8551.0
-10640096	4579952	45923	7592	20481	2030	9931	9037	7613	33931	9092.0	10678.0	6566.0	8678.0