Part 3: Tag Maps Clustering and Topic Heat Maps

Workshop: Social Media, Data Science, & Cartograpy
Alexander Dunkel, Madalina Gugulica

This is the third notebook in a series of four notebooks:

  1. Introduction to Social Media data, jupyter and python spatial visualizations
  2. Introduction to privacy issues with Social Media data and possible solutions for cartographers
  3. Specific visualization techniques example: TagMaps clustering
  4. Specific data analysis: Topic Classification

Open these notebooks through the file explorer on the left side.

Introduction: Privacy & Social Media

The task in this notebook is to extract and visualize common areas or regions of interest for a given topic from Location Based Social Media data (LBSM).

On Social Media, people react to many things, and a significant share of these reactions can be seen as forms of valuation or attribution of meaning to to the physical environment.

Challenges

However, for visualizing such values for whole cities, there are several challenges involved:

  • Identify reactions that are related to a given topic:
    • Topics can be considered at different levels of granularity
    • and equal terms may convey different meanings in different contexts.
  • Social Media data is distributed very irregularly.
  • Data peaks in urban areas and highly frequented tourist hotspots,
  • while some less frequented areas do not feature any data at all.
  • When visualizing spatial reactions to topics, this unequal distribution must be taking into account, otherwise maps would always be biased towards areas with higher density of information.
  • For this reason, normalizing results is crucial.
  • The usual approach for normalization in such cases is that local distribution is evaluated against the global (e.g. spatial) distribution data.
Tag Maps Package
  • The Tag Maps package was developed to cluster tags according to their spatial area of use
  • Currently, the final visualization step of Tag Maps clustering (placing labels etc.) is only implemented in ArcMap
  • Herein, we instead explore cluster results using Heat Maps and other graphics in python
  • In the second part of the workshop, it will be shown how to visualize the actual Tag Maps
Where is my custom area/tag map?
  • Below, you can chose between different regions/areas to explore in this notebook, provided based on student input
  • The final Tag Map visualization step is currently not possible in Jupyter/Python
  • We will meet at the end of January in the GeoPool, to visualize custom areas using ArcMap

Preparations

Load the TagMaps package, which will serve as a base for filtering, cleaning and processing noisy social media data

In [2]:
from tagmaps import TagMaps
from tagmaps import EMOJI, TAGS, TOPICS
from tagmaps import LoadData
from tagmaps import BaseConfig

Load Data & Plot Overview

Create 01_Input folder, which will be used for loading data.

In [3]:
from pathlib import Path

INPUT = Path.cwd() / "01_Input"

Retrieve sample LBSN CSV data:

In [4]:
%load_ext autoreload
%autoreload 2
In [5]:
import sys

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules import tools
In [6]:
source = "GrosserGarten.zip"
# source = "Jadavpur.zip"
# source = "Istanbul.zip"
# source = "Tokio.zip"
# source = "Kazan.zip"
# source = "Lindau.zip"
# source = "Hasenheide.zip"
# source = "tumunich.zip"
In [7]:
%%time
# clean any data first
tools.clean_folders([INPUT])
sample_url = tools.get_sample_url()
lbsn_dd_csv_uri = f'{sample_url}/download?path=%2F&files='
tools.get_zip_extract(
    uri=lbsn_dd_csv_uri,
    filename=source,
    output_path=INPUT,
    write_intermediate=True)
Loaded 35.86 MB of 35.87 (100%)..
Extracting zip..
Retrieved GrosserGarten.zip, extracted size: 100.03 MB
CPU times: user 1.79 s, sys: 859 ms, total: 2.64 s
Wall time: 13.2 s

Initialize tag maps from BaseConfig

In [8]:
tm_cfg = BaseConfig()
Loaded 337 stoplist items.
Loaded 214 inStr stoplist items.
Loaded 4 stoplist places.
Loaded 3 place lat/lng corrections.

Optionally, filter data per origin or adjust the number of top terms to extract:

In [9]:
tm_cfg.filter_origin = None
tm_cfg.max_items = 3000
In [10]:
tm_opts = {
    'tag_cluster':tm_cfg.cluster_tags,
    'emoji_cluster':tm_cfg.cluster_emoji,
    'location_cluster':tm_cfg.cluster_locations,
    'output_folder':tm_cfg.output_folder,
    'remove_long_tail':tm_cfg.remove_long_tail,
    'limit_bottom_user_count':tm_cfg.limit_bottom_user_count,
    'max_items':tm_cfg.max_items,
    'topic_cluster':True}
tm = TagMaps(**tm_opts)
**kwargs syntax?
  • Some functions take a long list of (keyword) arguments
  • These keyword arguments can be passed to the function using a dictionary
  • The ** syntax tells python to unpack keyword arguments from a dictionary

a) Read from original data

Read input records from csv

In [11]:
from IPython.display import clear_output
In [12]:
%%time
input_data = LoadData(tm_cfg)
with input_data as records:
    for ix, record in enumerate(records):
        tm.add_record(record)
        if (ix % 1000) != 0:
            continue
        # report every 1000 records
        clear_output(wait=True)
        print(f'Loaded {ix} records')
input_data.input_stats_report()
tm.prepare_data()
Loaded 244000 records
Cleaned input to 18229 distinct locations from 244942 posts (File 2 of 2) - Skipped posts: 0 - skipped tags: 146435 of 1596294

Total post count (PC): 244942
Total tag count (PTC): 1596294
Total emoji count (PEC): 167916
Long tail removal: Filtered 385 Emoji that were used by less than 2 users.
Long tail removal: Filtered 157643 Topics that were used by less than 5 users.
CPU times: user 1min 14s, sys: 1.2 s, total: 1min 16s
Wall time: 1min 15s

b) Optional: Write (& Read) cleaned output to file

Output cleaned data to Output/Output_cleaned.csv,
clean terms (post_body and tags) based on the top (e.g.) 1000 hashtags found in dataset.

In [13]:
tm.lbsn_data.write_cleaned_data(panon=True)
Writing cleaned intermediate data to file (Output_cleaned.csv)..
 done.

Have a look at the output file.

In [14]:
import pandas as pd

file_path = Path.cwd() / "02_Output" / "Output_cleaned.csv"
display(pd.read_csv(file_path, nrows=5).head())
print(f'{file_path.stat().st_size / (1024*1024):.02f} MB')
origin_id lat lng guid user_guid loc_id post_create_date post_publish_date post_body hashtags emoji post_views_count post_like_count loc_name
0 1 51.033300 13.733300 3722a733575bfbd6c173dc2deadb795b 3722a733575bfbd6c173dc2deadb795b 51.0333:13.7333 NaN NaN fürstenzug NaN NaN 704 0 Fürstenzug
1 1 51.033300 13.733300 ac28e6f721e25624d76a9eb3afc61b4f ac28e6f721e25624d76a9eb3afc61b4f 51.0333:13.7333 NaN NaN NaN NaN NaN 194 0 Hochschulstraße
2 1 51.033300 13.733300 d71246d019402566900f0d59d0c9fbef d71246d019402566900f0d59d0c9fbef 51.0333:13.7333 NaN NaN dresdner;stadtfest NaN NaN 181 0 Dresdner Stadtfest
3 1 51.031945 13.730668 0ff8be1630ab8decae97efecfda9f5f3 0ff8be1630ab8decae97efecfda9f5f3 51.031945353:13.7306680064 NaN NaN club NaN NaN 228 0 Club 11
4 1 51.031224 13.730091 aab11415bd4546adef5f513ac00edc90 aab11415bd4546adef5f513ac00edc90 51.0312241:13.7300907 NaN NaN international;house NaN NaN 66 0 International Guest House
46.03 MB

Read from pre-filtered data

Read data form (already prepared and filtered) cleaned output

In [15]:
%%time
tm.prepare_data(
    input_path=file_path)
CPU times: user 2.48 ms, sys: 0 ns, total: 2.48 ms
Wall time: 2.49 ms
In [16]:
tm.global_stats_report()
tm.item_stats_report()
Total user count (UC): 103684
Total user post locations (UPL): 168815
Number of locations with names: 2294
Total distinct tags (DTC): 174260
Total distinct emoji (DEC): 2078
Total distinct locations (DLC): 18229
Total tag count for the 3000 most used tags in selected area: 1778500.
Total emoji count for the 1693 most used emoji in selected area: 167487.
Bounds are: Min 13.727952 51.0257536334 Max 13.78818 51.051867

Topic selection & Tag Clustering

The next goal is to select reactions to given topics. TagMaps allows selecting posts for 4 different types:

  • TAGS (i.e. single terms)
  • EMOJI (i.e. single emoji)
  • LOCATIONS (i.e. named POIs or coordinate pairs)
  • TOPICS (i.e. list of terms)

Set basic plot to notebook mode and disable interactive plots for now

In [17]:
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['savefig.dpi'] = 120
mpl.rcParams['figure.dpi'] = 120

We can retrieve preview maps for the TOPIC dimension
by supplying a list of terms.

For example "park", "green" and "grass" should
give us an overview where such terms are used
in our sample area.

In [18]:
nature_terms = ["park", "grass", "nature"]
fig = tm.get_selection_map(
    TOPICS, nature_terms)
In [19]:
urban_terms = ["strasse", "city", "shopping"]
fig = tm.get_selection_map(
    TOPICS, urban_terms)

HDBSCAN Cluster

We can visualize clusters for the selected topic using HDBSCAN.

The important parameter for HDBSCAN is the cluster distance,
which is chosen automatically by Tag Maps given the current scale/extent of analysis.

In [20]:
# tm.clusterer[TOPICS].cluster_distance = 150
fig = tm.get_cluster_map(
    TOPICS, nature_terms)

We can get a map of clusters and cluster shapes (convex and concave hulls).

In [21]:
fig = tm.get_cluster_shapes_map(
    TOPICS, nature_terms)
--> 31 cluster.

Behind the scenes, Tag Maps utilizes the Single Linkage Tree from HDBSCAN to cut clusters at a specified distance. This tree shows the hierarchical structure for our topic and its spatial properties in the given area.

In [22]:
fig = tm.get_singlelinkagetree_preview(
    TOPICS, nature_terms)

Cluster centroids

Similarly, we can retrieve centroids of clusters. This shows again the unequal distribution of data:

In [23]:
fig = tm.clusterer[TOPICS].get_cluster_centroid_preview(
    nature_terms, single_clusters=True)
(1 of 16617) Found 2928 posts (UPL) for Topic 'park-grass-nature' (found in 6% of DLC in area) --> 31 cluster.

Heat Maps

  • Visualization of clustered tagmaps data is possible in several ways.
  • In the second workshop (end of January), we are using ArcMap, to create labeled maps from clustered data
  • The last part in this notebook will be to use Kernel Densitry Estimation (KDE) to create a Heat Map for selected topics.

Import dependencies

In [24]:
import numpy as np
from sklearn.neighbors import KernelDensity

Load Flickr data only

Instagram, Facebook and Twitter are based on place-accuracy, which is unsuitable for the Heat Map graphic.

We'll work with Flickr data only for the Heat Map.

  • 1 = Instagram
  • 2 = Flickr
  • 3 = Twitter

Reload data, filtering only Flickr

In [25]:
tm_cfg.filter_origin = "2"
tm = TagMaps(**tm_opts)
In [26]:
%%time
input_data = LoadData(tm_cfg)
with input_data as records:
    for ix, record in enumerate(records):
        tm.add_record(record)
        if (ix % 1000) != 0:
            continue
        # report every 1000 records
        clear_output(wait=True)
        print(f'Loaded {ix} records')
input_data.input_stats_report()
Loaded 31000 records
Cleaned input to 15951 distinct locations from 31309 posts (File 2 of 2) - Skipped posts: 213572 - skipped tags: 36969 of 232936

Total post count (PC): 31309
Total tag count (PTC): 232936
Total emoji count (PEC): 0
CPU times: user 8.42 s, sys: 297 ms, total: 8.72 s
Wall time: 8.63 s

Get Topic Coordinates

The distribution of these coordinates is what we want to visualize.

In [27]:
tm.prepare_data()
tm.init_cluster()
Long tail removal: Filtered 1959 Tags that were used by less than 5 users.
Long tail removal: Filtered 13164 Topics that were used by less than 5 users.

Topic selection

In [28]:
topic = "grass-nature-park"
points = tm.clusterer[TOPICS].get_np_points(
    item=topic,
    silent=True)
Why a different syntax? (term-term-term)
  • The API of tagmaps package is still in an early stage
  • The above function uses the internal representation of topics

Get All Points

In [29]:
all_points = tm.clusterer[TOPICS].get_np_points()

For normalizing our final KDE raster, we'll process both, topic selection points and global data distribution (e.g. all points in the dataset).

In [30]:
points_list = [points, all_points]
  • The input data is a simply list (as a numpy.ndarray) of decimal degree coordinates
  • each entry represents a single user that published one or more posts at a specific coordinate
In [31]:
print(points[:5])
print(f'Total coordinates: {len(points)}')
[[13.766544 51.036369]
 [13.755054 51.039289]
 [13.753187 51.043785]
 [13.764281 51.034242]
 [13.753069 51.038263]]
Total coordinates: 788

Data projection

  • For faster KDE, we project data from WGS1984 (epsg:4326) to UTM
  • this allows us to directly calculate in euclidian space.
  • The TagMaps package automatically detected the most suitable UTM coordinate system,
  • for the Großer Garten sample data, this is Zone 33N (epsg:32633)

Project lat/lng to UTM 33N (Dresden) using Transformer:

In [32]:
%%time
for idx, points in enumerate(points_list):
    points_list[idx] = np.array(
        [tm.clusterer[TOPICS].proj_transformer.transform(
                point[0], point[1]
        ) for point in points])
    
crs_wgs = tm.clusterer[TOPICS].crs_wgs
crs_proj = tm.clusterer[TOPICS].crs_proj
print(f'From {crs_wgs} \nto {crs_proj}')
print(points_list[0][:5])
From epsg:4326 
to epsg:32633
[[ 413517.95931985 5654593.12815275]
 [ 412717.86708307 5654931.37875778]
 [ 412595.43876854 5655433.54329967]
 [ 413355.33045049 5654359.26640775]
 [ 412576.773973   5654819.64201159]]
CPU times: user 137 ms, sys: 2.04 ms, total: 139 ms
Wall time: 135 ms

Calculating the Kernel Density

To summarize sklearn, a KDE is executed in two steps, training and testing:

Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.

The only attributes we care about in training are lat and long.

  • Stack locations using np.vstack, extract specific columns with [:,column_id]
  • reverse order: lat, lng
  • Transpose to Rows (.T), easier in python ref
In [33]:
xy_list = list()
for points in points_list:
    y = points[:, 1]
    x = points[:, 0]
    xy_list.append([x, y])
In [34]:
xy_train_list = list()
for x, y in xy_list:
    xy_train = np.vstack([y, x]).T
    xy_train_list.append(xy_train)
In [35]:
print(xy_train)
[[5654951.90390813  411219.87889988]
 [5655119.32031662  411076.89589361]
 [5654938.86105683  411169.79390272]
 ...
 [5655302.64257714  411360.89388318]
 [5654864.8466037   411463.97303812]
 [5655317.87392501  411322.44768116]]

Get bounds from total data

Access min/max decimal degree bounds object of clusterer and project to UTM33N

In [36]:
lim_lng_max = tm.clusterer[TAGS].bounds.lim_lng_max
lim_lng_min = tm.clusterer[TAGS].bounds.lim_lng_min
lim_lat_max = tm.clusterer[TAGS].bounds.lim_lat_max
lim_lat_min = tm.clusterer[TAGS].bounds.lim_lat_min

# project WDS1984 to UTM
topright = tm.clusterer[TOPICS].proj_transformer.transform(
    lim_lng_max, lim_lat_max)
bottomleft = tm.clusterer[TOPICS].proj_transformer.transform(
    lim_lng_min, lim_lat_min)

# get separate min/max for x/y
right_bound = topright[0]
left_bound = bottomleft[0]
top_bound = topright[1]
bottom_bound = bottomleft[1]

Create Sample Mesh

Create a grid of points at which to predict.

In [37]:
xbins=100j
ybins=100j
xx, yy = np.mgrid[
    left_bound:right_bound:xbins, 
    bottom_bound:top_bound:ybins]
In [38]:
xy_sample = np.vstack([yy.ravel(), xx.ravel()]).T

Generate Training data for Kernel Density.

  • The largest effect on the final result comes from the chosen bandwidth for KDE - smaller bandwidth mean higher resultion,
  • but may not be suitable for the given density of data (e.g. results with low validity).
  • Higher bandwidth will produce a smoother raster result, which may be too inaccurate for interpretation.
In [39]:
kde_list = list()
for xy_train in xy_train_list:
    kde = KernelDensity(
        kernel='gaussian', bandwidth=200, algorithm='ball_tree')
    kde.fit(xy_train)
    kde_list.append(kde)

score_samples() returns the log-likelihood of the samples

In [40]:
%%time
z_list = list()
for kde in kde_list:
    z_scores = kde.score_samples(xy_sample)
    z = np.exp(z_scores)
    z_list.append(z)
CPU times: user 7.68 s, sys: 14.5 ms, total: 7.69 s
Wall time: 7.67 s

Remove values below zero,
these are locations where selected topic is underrepresented, given the global KDE mean

In [41]:
for ix in range(0, 2):
    z_list[ix] = z_list[ix].clip(min=0)

Normalize z-scores to 1 to 1000 range for comparison

In [42]:
for idx, z in enumerate(z_list):
    z_list[idx] = np.interp(
        z, (z.min(), z.max()), (1, 1000))

Subtact global zscores from local z-scores (our topic selection)

norm_val:

  • weight of global z-score
  • lower means less normalization effect, higher means stronger normalization effect
  • range: 0 to 1
In [43]:
norm_val = 0.5
z_orig = z_list[0]
z_is = z_list[0] - (z_list[1]*norm_val)
z_results = [z_orig, z_is]

Reshape results to fit grid mesh extent

In [44]:
for idx, z_result in enumerate(z_results):
    z_results[idx] = z_results[idx].reshape(xx.shape)

Plot original and normalized meshes

In [45]:
from matplotlib.ticker import FuncFormatter
def y_formatter(y_value, __):
    """Format UTM to decimal degrees y-labels for improved legibility"""
    xy_value = tm.clusterer[
        TOPICS].proj_transformer_back.transform(
            left_bound, y_value)
    return f'{xy_value[1]:3.2f}'

def x_formatter(x_value, __):
    """Format UTM to decimal degrees x-labels for improved legibility"""
    xy_value = tm.clusterer[
        TOPICS].proj_transformer_back.transform(
            x_value, bottom_bound)
    return f'{xy_value[0]:3.1f}'

# a figure with a 1x2 grid of Axes
fig, ax_lst = plt.subplots(1, 2, figsize=(10, 3))

for idx, zz in enumerate(z_results):
    axis = ax_lst[idx]
    # Change the fontsize of minor ticks label 
    axis.tick_params(axis='both', which='major', labelsize=7)
    axis.tick_params(axis='both', which='minor', labelsize=7)
    # set plotting bounds
    axis.set_xlim(
                [left_bound, right_bound])
    axis.set_ylim(
                [bottom_bound, top_bound])
    
    # plot contours of the density
    levels = np.linspace(zz.min(), zz.max(), 25)
    # Create Contours,
    # save to CF-variable so we can later reference
    # it in colorbar hook
    CF = axis.contourf(
        xx, yy, zz, levels=levels.sort(), cmap='viridis')
    # titles for plots
    if idx == 0:
        title = f'Original KDE for topic {topic}\n(Flickr Only)'
    else:
        title = f'Normalized KDE for topic {topic}\n(Flickr Only)'
    axis.set_title(title, fontsize=11)
    axis.set_xlabel('lng', fontsize=10)
    axis.set_ylabel('lat', fontsize=10)
    # plot points on top
    axis.scatter(
        xy_list[1][0], xy_list[1][1], s=1, facecolor='white', alpha=0.05)
    axis.scatter(
        xy_list[0][0], xy_list[0][1], s=1, facecolor='white', alpha=0.1)
    # replace x, y coords with decimal degrees text (instead of UTM dist)
    axis.yaxis.set_major_formatter(
        FuncFormatter(y_formatter))
    axis.xaxis.set_major_formatter(
        FuncFormatter(x_formatter))
    if idx > 0:
        break
    # Make a colorbar for the ContourSet returned by the contourf call.
    cbar = fig.colorbar(CF,fraction=0.046, pad=0.04)
    cbar.set_title = ""
    cbar.sup_title = ""
    cbar.ax.set_ylabel(
        "Number of \n User Post Locations (UPL)", fontsize=11)
    cbar.ax.set_title('')
    cbar.ax.tick_params(axis='both', labelsize=7)
    cbar.ax.yaxis.set_tick_params(width=0.5, length=4)