Workshop: Social Media, Data Science, & Cartograpy
Alexander Dunkel, Madalina Gugulica
This is the third notebook in a series of four notebooks:
Open these notebooks through the file explorer on the left side.
The task in this notebook is to extract and visualize common areas or regions of interest for a given topic from Location Based Social Media data (LBSM).
On Social Media, people react to many things, and a significant share of these reactions can be seen as forms of valuation or attribution of meaning to to the physical environment.
However, for visualizing such values for whole cities, there are several challenges involved:
Important:
Define output directory
from pathlib import Path
OUTPUT = Path.cwd() / "out"
Load the TagMaps package, which will serve as a base for filtering, cleaning and processing noisy social media data
from tagmaps import TagMaps
from tagmaps import EMOJI, TAGS, TOPICS
from tagmaps import LoadData
from tagmaps import BaseConfig
Create 01_Input
folder, which will be used for loading data.
INPUT = Path.cwd() / "01_Input"
Retrieve sample LBSN CSV data:
%load_ext autoreload
%autoreload 2
import sys
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules import tools
source = "GrosserGarten.zip"
# source = "Jadavpur.zip"
# source = "Istanbul.zip"
# source = "Tokio.zip"
# source = "Kazan.zip"
# source = "Lindau.zip"
# source = "Hasenheide.zip"
# source = "tumunich.zip"
%%time
# clean any data first
tools.clean_folders([INPUT])
sample_url = tools.get_sample_url()
lbsn_dd_csv_uri = f'{sample_url}/download?path=%2F&files='
tools.get_zip_extract(
uri=lbsn_dd_csv_uri,
filename=source,
output_path=INPUT,
write_intermediate=True)
Initialize tag maps from BaseConfig
tm_cfg = BaseConfig()
Optionally, filter data per origin or adjust the number of top terms to extract:
tm_cfg.filter_origin = None
tm_cfg.max_items = 3000
tm_opts = {
'tag_cluster':tm_cfg.cluster_tags,
'emoji_cluster':tm_cfg.cluster_emoji,
'location_cluster':tm_cfg.cluster_locations,
'output_folder':tm_cfg.output_folder,
'remove_long_tail':tm_cfg.remove_long_tail,
'limit_bottom_user_count':tm_cfg.limit_bottom_user_count,
'max_items':tm_cfg.max_items,
'topic_cluster':True}
tm = TagMaps(**tm_opts)
**kwargs
syntax?**
syntax tells python to unpack keyword arguments from a dictionaryRead input records from csv
from IPython.display import clear_output
%%time
input_data = LoadData(tm_cfg)
with input_data as records:
for ix, record in enumerate(records):
tm.add_record(record)
if (ix % 1000) != 0:
continue
# report every 1000 records
clear_output(wait=True)
print(f'Loaded {ix} records')
input_data.input_stats_report()
tm.prepare_data()
Output cleaned data to Output/Output_cleaned.csv
,
clean terms (post_body and tags) based on the top (e.g.) 1000 hashtags found in dataset.
tm.lbsn_data.write_cleaned_data(panon=True)
Have a look at the output file.
import pandas as pd
file_path = Path.cwd() / "02_Output" / "Output_cleaned.csv"
display(pd.read_csv(file_path, nrows=5).head())
print(f'{file_path.stat().st_size / (1024*1024):.02f} MB')
Read from pre-filtered data
Read data form (already prepared and filtered) cleaned output
%%time
tm.prepare_data(
input_path=file_path)
tm.global_stats_report()
tm.item_stats_report()
The next goal is to select reactions to given topics. TagMaps allows selecting posts for 4 different types:
Set basic plot to notebook mode and disable interactive plots for now
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['savefig.dpi'] = 120
mpl.rcParams['figure.dpi'] = 120
We can retrieve preview maps for the TOPIC dimension
by supplying a list of terms.
For example "park", "green" and "grass" should
give us an overview where such terms are used
in our sample area.
nature_terms = ["park", "grass", "nature"]
fig = tm.get_selection_map(
TOPICS, nature_terms)
urban_terms = ["strasse", "city", "shopping"]
fig = tm.get_selection_map(
TOPICS, urban_terms)
We can visualize clusters for the selected topic using HDBSCAN.
The important parameter for HDBSCAN is the cluster distance,
which is chosen automatically by Tag Maps given the current scale/extent of analysis.
# tm.clusterer[TOPICS].cluster_distance = 150
fig = tm.get_cluster_map(
TOPICS, nature_terms)
We can get a map of clusters and cluster shapes (convex and concave hulls).
fig = tm.get_cluster_shapes_map(
TOPICS, nature_terms)
Behind the scenes, Tag Maps utilizes the Single Linkage Tree from HDBSCAN to cut clusters at a specified distance. This tree shows the hierarchical structure for our topic and its spatial properties in the given area.
fig = tm.get_singlelinkagetree_preview(
TOPICS, nature_terms)
Similarly, we can retrieve centroids of clusters. This shows again the unequal distribution of data:
fig = tm.clusterer[TOPICS].get_cluster_centroid_preview(
nature_terms, single_clusters=True)
Import dependencies
import numpy as np
from sklearn.neighbors import KernelDensity
Instagram, Facebook and Twitter are based on place-accuracy, which is unsuitable for the Heat Map graphic.
We'll work with Flickr data only for the Heat Map.
Reload data, filtering only Flickr
tm_cfg.filter_origin = "2"
tm = TagMaps(**tm_opts)
%%time
input_data = LoadData(tm_cfg)
with input_data as records:
for ix, record in enumerate(records):
tm.add_record(record)
if (ix % 1000) != 0:
continue
# report every 1000 records
clear_output(wait=True)
print(f'Loaded {ix} records')
input_data.input_stats_report()
The distribution of these coordinates is what we want to visualize.
tm.prepare_data()
tm.init_cluster()
Topic selection
topic = "grass-nature-park"
points = tm.clusterer[TOPICS].get_np_points(
item=topic,
silent=True)
term-term-term
)tagmaps
package is still in an early stageall_points = tm.clusterer[TOPICS].get_np_points()
For normalizing our final KDE raster, we'll process both, topic selection points and global data distribution (e.g. all points in the dataset).
points_list = [points, all_points]
numpy.ndarray
) of decimal degree coordinatesprint(points[:5])
print(f'Total coordinates: {len(points)}')
epsg:4326
) to UTMepsg:32633
)Project lat/lng to UTM 33N (Dresden) using Transformer:
%%time
for idx, points in enumerate(points_list):
points_list[idx] = np.array(
[tm.clusterer[TOPICS].proj_transformer.transform(
point[0], point[1]
) for point in points])
crs_wgs = tm.clusterer[TOPICS].crs_wgs
crs_proj = tm.clusterer[TOPICS].crs_proj
print(f'From {crs_wgs} \nto {crs_proj}')
print(points_list[0][:5])
To summarize sklearn, a KDE is executed in two steps, training and testing:
Machine learning is about learning some properties of a data set and then testing those properties against another data set. A common practice in machine learning is to evaluate an algorithm by splitting a data set into two. We call one of those sets the training set, on which we learn some properties; we call the other set the testing set, on which we test the learned properties.
The only attributes we care about in training
are lat and long.
xy_list = list()
for points in points_list:
y = points[:, 1]
x = points[:, 0]
xy_list.append([x, y])
xy_train_list = list()
for x, y in xy_list:
xy_train = np.vstack([y, x]).T
xy_train_list.append(xy_train)
print(xy_train)
Get bounds from total data
Access min/max decimal degree bounds object of clusterer and project to UTM33N
lim_lng_max = tm.clusterer[TAGS].bounds.lim_lng_max
lim_lng_min = tm.clusterer[TAGS].bounds.lim_lng_min
lim_lat_max = tm.clusterer[TAGS].bounds.lim_lat_max
lim_lat_min = tm.clusterer[TAGS].bounds.lim_lat_min
# project WDS1984 to UTM
topright = tm.clusterer[TOPICS].proj_transformer.transform(
lim_lng_max, lim_lat_max)
bottomleft = tm.clusterer[TOPICS].proj_transformer.transform(
lim_lng_min, lim_lat_min)
# get separate min/max for x/y
right_bound = topright[0]
left_bound = bottomleft[0]
top_bound = topright[1]
bottom_bound = bottomleft[1]
Create Sample Mesh
Create a grid of points at which to predict.
xbins=100j
ybins=100j
xx, yy = np.mgrid[
left_bound:right_bound:xbins,
bottom_bound:top_bound:ybins]
xy_sample = np.vstack([yy.ravel(), xx.ravel()]).T
Generate Training data for Kernel Density.
bandwidth
parameterkde_list = list()
for xy_train in xy_train_list:
kde = KernelDensity(
kernel='gaussian', bandwidth=200, algorithm='ball_tree')
kde.fit(xy_train)
kde_list.append(kde)
score_samples()
returns the log-likelihood of the samples
%%time
z_list = list()
for kde in kde_list:
z_scores = kde.score_samples(xy_sample)
z = np.exp(z_scores)
z_list.append(z)
Remove values below zero,
these are locations where selected topic is underrepresented, given the global KDE mean
for ix in range(0, 2):
z_list[ix] = z_list[ix].clip(min=0)
Normalize z-scores to 1 to 1000 range for comparison
for idx, z in enumerate(z_list):
z_list[idx] = np.interp(
z, (z.min(), z.max()), (1, 1000))
Subtact global zscores from local z-scores (our topic selection)
norm_val
:
0
to 1
norm_val = 0.5
z_orig = z_list[0]
z_is = z_list[0] - (z_list[1]*norm_val)
z_results = [z_orig, z_is]
norm_val
parameterReshape results to fit grid mesh extent
for idx, z_result in enumerate(z_results):
z_results[idx] = z_results[idx].reshape(xx.shape)
Plot original and normalized meshes
from matplotlib.ticker import FuncFormatter
def y_formatter(y_value, __):
"""Format UTM to decimal degrees y-labels for improved legibility"""
xy_value = tm.clusterer[
TOPICS].proj_transformer_back.transform(
left_bound, y_value)
return f'{xy_value[1]:3.2f}'
def x_formatter(x_value, __):
"""Format UTM to decimal degrees x-labels for improved legibility"""
xy_value = tm.clusterer[
TOPICS].proj_transformer_back.transform(
x_value, bottom_bound)
return f'{xy_value[0]:3.1f}'
# a figure with a 1x2 grid of Axes
fig, ax_lst = plt.subplots(1, 2, figsize=(10, 3))
for idx, zz in enumerate(z_results):
axis = ax_lst[idx]
# Change the fontsize of minor ticks label
axis.tick_params(axis='both', which='major', labelsize=7)
axis.tick_params(axis='both', which='minor', labelsize=7)
# set plotting bounds
axis.set_xlim(
[left_bound, right_bound])
axis.set_ylim(
[bottom_bound, top_bound])
# plot contours of the density
levels = np.linspace(zz.min(), zz.max(), 15)
# Create Contours,
# save to CF-variable so we can later reference
# it in colorbar hook
CF = axis.contourf(
xx, yy, zz, levels=levels, cmap='viridis')
# titles for plots
if idx == 0:
title = f'Original KDE for topic {topic}\n(Flickr Only)'
else:
title = f'Normalized KDE for topic {topic}\n(Flickr Only)'
axis.set_title(title, fontsize=11)
axis.set_xlabel('lng', fontsize=10)
axis.set_ylabel('lat', fontsize=10)
# plot points on top
axis.scatter(
xy_list[1][0], xy_list[1][1], s=1, facecolor='white', alpha=0.05)
axis.scatter(
xy_list[0][0], xy_list[0][1], s=1, facecolor='white', alpha=0.1)
# replace x, y coords with decimal degrees text (instead of UTM dist)
axis.yaxis.set_major_formatter(
FuncFormatter(y_formatter))
axis.xaxis.set_major_formatter(
FuncFormatter(x_formatter))
if idx > 0:
break
# Make a colorbar for the ContourSet returned by the contourf call.
cbar = fig.colorbar(CF,fraction=0.046, pad=0.04)
cbar.set_title = ""
cbar.sup_title = ""
cbar.ax.set_ylabel(
"Number of \n User Post Locations (UPL)", fontsize=11)
cbar.ax.set_title('')
cbar.ax.tick_params(axis='both', labelsize=7)
cbar.ax.yaxis.set_tick_params(width=0.5, length=4)