Extracting common areas of interest for given topics from LBSM

The task in this notebook is to extract and visualize common areas or regions of interest for a given topic from Location Based Social Media data (LBSM). On Social Media, people react to many things, and a significant share of these reactions can be seen as forms of valuation or attribution of meaning to to the physical environment. However, for visualizing such values for whole cities, there are several challenges involved:

  • Identify reactions that are related to a given topic: Topics can be considered at different levels of granularity and equal terms may convey different meanings in different contexts.
  • Social Media data is distributed very irregularly. Data peaks in urban areas and highly frequented tourist hotspots, while some less frequented areas do not have any data at all. When visualizing spatial reactions to topics, this inequal distribution must be taking into account, otherwise maps would always be biased towards areas with higher density of information. For this reason, normalizing results is crucial. The usual approach for normalization in such cases is that local distribution is evaluated against the global (e.g. spatial) distribution data.

For development purposes only: enable auto reload of develop dependencies, this will auto reload packages if code changes

In [1]:
%load_ext autoreload
%autoreload 2
#%autoreload tagmaps

Load the TagMaps package, which will serve as a base for filtering, cleaning and processing noisy social media data

In [2]:
from tagmaps import TagMaps
from tagmaps import EMOJI, TAGS, LOCATIONS, TOPICS
from tagmaps import LoadData
from tagmaps import BaseConfig
from tagmaps.classes.utils import Utils

1. Load Data & Overview

In [3]:
cfg = BaseConfig()
Loaded 320 stoplist items.
Loaded 214 inStr stoplist items.
Loaded 3 stoplist places.

Initialize tag maps from BaseConfig

In [4]:
tm = TagMaps(
    tag_cluster=cfg.cluster_tags, emoji_cluster=cfg.cluster_emoji,
    location_cluster=cfg.cluster_locations,
    output_folder=cfg.output_folder, remove_long_tail=cfg.remove_long_tail,
    limit_bottom_user_count=cfg.limit_bottom_user_count,
    max_items=cfg.max_items, topic_cluster=True)

a) Read from original data

# read input records from csv input_data = LoadData(cfg) with input_data as records: for record in records: tm.add_record(record) input_data.input_stats_report()tm.lbsn_data.write_cleaned_data(panon=False)

b) Read from pre-filtered intermediate data

Read data form (already prepared) cleaned output. Cleaned output is a filtered version of original data of type UserPostLocation (UPL), which is a reference in tagmaps as 'CleanedPost'. A UPL simply has all posts of a single user at a single coordinate merged, e.g. a reduced list of terms, tags and emoji based on global occurrence (i.e. no duplicates). there're other metrics used throughout the tagmaps package, which will become handy for measuring topic reactions here:

  • UPL - User post location
  • UC - User count
  • PC - Post count
  • PTC - Total tag count ("Post Tag Count")
  • PEC - Total emoji count
  • UD - User days (counting distinct users per day)
  • DTC - Total distinct tags
  • DEC - Total distinct emoji
  • DLC - Total distinct locations
In [5]:
tm.prepare_data(input_path='01_Input/Output_cleaned_flickr.csv')

Print stats for input data.

In [6]:
tm.global_stats_report()
tm.item_stats_report()
Total user count (UC): 5991
Total user post locations (UPL): 67434
Total distinct tags (DTC): 1000
Total distinct emoji (DEC): 0
Total distinct locations (DLC): 66093
Total tag count for the 1000 most used tags in selected area: 252510.
Total emoji count for the 1000 most used emoji in selected area: 0.
Bounds are: Min 13.581848 50.972288 Max 13.88668 51.138001

2. Topic Selection & Clustering

The next goal is to select reactions to given topics. TagMaps allows selecting posts for 4 different types:

  • TAGS (i.e. single terms)
  • EMOJI (i.e. single emoji)
  • LOCATIONS (i.e. named POIs or coordinate pairs)
  • TOPICS (i.e. list of terms)

Set basic plot to notebook mode and disable interactive plots for now

In [7]:
import matplotlib.pyplot as plt
#%matplotlib notebook
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['savefig.dpi'] = 120
mpl.rcParams['figure.dpi'] = 120

We can retrieve preview maps for the TOPIC dimension by supplying a list of terms, for example "elbe", "fluss" and "river" should give us an overview where such terms are used in our sample area

In [8]:
fig = tm.get_selection_map(TOPICS,["elbe","fluss","river"])

This approach is good for terms that clearly relate to specific and named physical aspects such as the "elbe" river, which exists only once. Selecting reactions from a list of terms is more problematic for general topics such as park, nature and "green-related" activities. Have a look:

In [9]:
fig = tm.get_selection_map(TOPICS,["grass","nature","park"])

We can see some areas of increased coverage: the Großer Garten, some parts of the Elbe River. But also the city center, which may show a representative picture. However, there's a good chance that we catched many false-positives and false-negatives with this clear-cut selection.

Topic Vectors A better approach would be to select Topics based on "Topic-Vectors", e.g. after a list of terms is supplied, a vector is calculated that represents this topic. This topic vector can then be used to select posts that have similar vectors. Using a wide or narrow search angle, the broadness of selection can be affected. This is currently not implemented here but an important future improvement.

2a. HDBSCAN Cluster

We can visualize clusters for the selected topic using HDBSCAN. The important parameter for HDBSCAN is the cluster distance, which is chosen automatically by Tag Maps given the current scale of analysis. In the following, we manually reduce the cluster distance from 800 to 400, so we can a finer grained picture of topic clusters.

In [10]:
# tm.clusterer[TOPICS].autoselect_clusters = False
tm.clusterer[TOPICS].cluster_distance = 400
fig = tm.get_cluster_map(TOPICS,["grass","nature","park"])

Equally, we can get a map of clusters and cluster shapes (convex and concave hulls).

In [11]:
tm.clusterer[TOPICS].cluster_distance = 400
fig = tm.get_cluster_shapes_map(TOPICS,["grass","nature","park"])
--> 70 cluster.

Behind the scenes, Tag Maps utilizes the Single Linkage Tree from HDBSCAN to cut clusters at a specified distance. This tree shows the hierarchical structure for our topic and its spatial properties in the given area.

In [11]:
fig = tm.get_singlelinkagetree_preview(TOPICS,["grass", "park", "nature"])

The cluster distance can be manually set. If autoselect clusters is True, HDBSCAN will try to identify clusters based on its own algorithm.

In [10]:
tm.clusterer[TOPICS].cluster_distance = 400
# tm.clusterer[TOPICS].autoselect_clusters = False

2b Cluster centroids

Similarly, we can retrieve centroids of clusters. This shows again the unequal distribution of data:

In [19]:
fig = tm.clusterer[TOPICS].get_cluster_centroid_preview(["elbe","fluss","river"])
(1 of 1000) Found 4599 posts (UPL) for Topic 'elbe-fluss-river' (found in 7% of DLC in area) --> 132 cluster.

To get only the important clusters, we can drop all single cluster items:

In [11]:
# tm.clusterer[TOPICS].get_cluster_centroid_preview(["elbe", "fluss", "river"])
fig = tm.clusterer[TOPICS].get_cluster_centroid_preview(["elbe","fluss","river"], single_clusters=False)
(1 of 1000) Found 4599 posts (UPL) for Topic 'elbe-fluss-river' (found in 7% of DLC in area) --> 132 cluster.

3a. Datashader

In [47]:
# census example http://datashader.org/topics/census.html
# https://github.com/pyviz/datashader/issues/590
# http://datashader.org/topics/bay_trimesh.html
# https://github.com/pyviz/datashader/issues/582
# http://holoviews.org/reference/elements/bokeh/TriMesh.html
# Possible solution.
# 
# 1. Create the canvas with the size (per pixel) desired.
# 2. use datashader aggreate, flipud
# 3. use rasterio to create the geotiff.
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf
import dask.dataframe as dd
import numpy as np
from datashader.utils import lnglat_to_meters

background = "black"
In [48]:
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9
from IPython.core.display import HTML, display
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))

display(HTML("<style>.container { width:100% !important; }</style>"))

Select points

In [38]:
topic = "grass-nature-park"
points = tm.clusterer[TOPICS].get_np_points(
            item=topic,
            silent=True)
In [39]:
all_points = tm.clusterer[TOPICS].get_np_points()

project data to auto detected UTM Zone from TagMaps

In [40]:
import pyproj
# project lat/lng to UTM 33N (Dresden)
crs_wgs = tm.clusterer[TOPICS].crs_wgs
crs_proj = tm.clusterer[TOPICS].crs_proj

all_points = np.array([pyproj.transform(crs_wgs, crs_proj, point[0], point[1]) for point in all_points])
In [41]:
df = pd.DataFrame(data=all_points)
df.columns = ['longitude', 'latitude']
In [42]:
df.head()
Out[42]:
longitude latitude
0 411251.434448 5.656461e+06
1 413645.798027 5.654811e+06
2 412629.872662 5.654713e+06
3 411578.674882 5.656595e+06
4 411234.229767 5.656513e+06
df.loc[:, 'longitude'], df.loc[:, 'latitude'] = lnglat_to_meters(df.longitude,df.latitude)
In [43]:
# set plotting bounds (Dresden)
right = df.longitude.max()
left = df.longitude.min()
top = df.latitude.max()
bottom = df.latitude.min()
x_range, y_range = ((left,right), (bottom, top))
x_dist = x_range[1]-x_range[0]
y_dist = y_range[1]-y_range[0]
ratio = x_dist/y_dist
print(ratio)
plot_width  = int(750)
plot_height = int(plot_width//ratio)
1.1471341746537014
In [44]:
%%time
from colorcet import fire

cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'longitude', 'latitude')
#img = tf.shade(agg)
#img = export(tf.shade(agg, cmap = cm(Greys9,0.2), how='eq_hist'), "census_gray_eq_hist")
#img = export(tf.shade(ds.Canvas().raster(agg, interpolate='nearest'), name="nearest-neighbor interpolation"))
img = export(tf.shade(agg, cmap = cm(fire,0.2), how='eq_hist'),"census_ds_fire_eq_hist")
Wall time: 956 ms
In [45]:
img
Out[45]:

Interactive DS Image:

In [59]:
import holoviews as hv
import datashader as ds
from holoviews.operation.datashader import aggregate, datashade

from datashader.bokeh_ext import InteractiveImage
import bokeh.plotting as bp

def image_callback(x_range, y_range, w, h, name=None):
    cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)
    agg = cvs.points(df, 'longitude', 'latitude')
    img = tf.shade(agg, cmap = cm(fire,0.2), how='eq_hist')
    return export(tf.shade(agg, cmap = cm(fire,0.2), how='eq_hist'),"census_ds_fire_eq_hist")

bp.output_notebook()
p = bp.figure(tools='pan,wheel_zoom,reset', x_range=x_range, y_range=y_range, plot_width=plot_width, plot_height=plot_height)

InteractiveImage(p, image_callback)
Loading BokehJS ...
Out[59]:

3c. IDW Interpolation (Raster)

IDW (inverse distance weighting) has been tested andfound not suitable. It is shown here for reasons of comparison (to KDE/HDBSCAN). To summarize the IDW approach:

- use clustered centroids from HDBSCAN
- inverse interpolate usercounts per cluster to raster
In [15]:
import param
import panel as pn
import holoviews as hv
import numpy as np
import xmsinterp as interp
import pandas as pd
import xarray as xr
import hvplot.pandas
import math
import hvplot.xarray
from holoviews.streams import Params

Get data from TagMaps

In [60]:
tm.clusterer[TOPICS].cluster_distance = 400
points, usercount = tm.clusterer[TOPICS].get_cluster_centroid_data(
    ["grass","park","garden"], projected=True, single_clusters=False)
(1 of 1000) Found 985 posts (UPL) for Topic 'grass-park-garden' (found in 1% of DLC in area) --> 28 cluster.
In [61]:
# from datashader.utils import lnglat_to_meters as webm
df = pd.DataFrame(data=points)
df['usercount'] = usercount
# df = pd.DataFrame(data=points)
df.columns = ['longitude', 'latitude', 'usercount']
# coordinates in web mercator
# df.loc[:, 'longitude'], df.loc[:, 'latitude'] = webm(df.longitude,df.latitude)
# df.loc[:, 'longitude'], df.loc[:, 'latitude'] = lnglat_to_meters(df.longitude,df.latitude)

# reduce size of largest clusters
df['usercount'] = df['usercount'].apply(lambda x: math.sqrt(x))

#df.loc[:, 'usercount'] = math.sqrt(df.usercount).tolist()
df.head()
Out[61]:
longitude latitude usercount
0 413328.418103 5.654654e+06 9.591663
1 420729.205747 5.651568e+06 7.416198
2 411532.034409 5.656507e+06 8.717798
3 415736.755097 5.657597e+06 3.316625
4 412910.413566 5.658712e+06 4.358899

Rename columns and reverse order for plotting

In [62]:
#points = tm.clusterer[TOPICS]._get_np_points(
#            item=["green", "nature", "park"],
#            silent=True)
# df = pd.DataFrame(data=points, columns=["longitude", "latitude"])
# to study number of overlapping coordinates:
# df = df.groupby(["x", "y"]).size().to_frame('c').reset_index()

# to study number of users per cluster:
df = df.rename(columns={'longitude': 'x', 'latitude': 'y', 'usercount': 'c'})
# reverse order so that top points appear first on the plot
df = df.iloc[::-1]
In [63]:
df.head()
Out[63]:
x y c
27 412227.611383 5.653974e+06 1.000000
26 416961.447850 5.656817e+06 1.000000
25 412162.510344 5.657557e+06 1.414214
24 412269.786018 5.652359e+06 1.000000
23 415780.213229 5.650487e+06 1.000000

Visualize in hvplot

In [64]:
# plot preview in matplotlib
df.hvplot.scatter(x='x', y='y', c='c', height=480, width=800, cmap='jet', size=120, colorbar=True)
Out[64]:

bokeh_plot.png

Visualize in Geoviews

In [65]:
import xarray as xr
import numpy as np
import pandas as pd
import holoviews as hv
import geoviews as gv
import geoviews.feature as gf

import cartopy
from cartopy import crs as ccrs

import geoviews.tile_sources as gts
hv.notebook_extension('bokeh')
In [66]:
df.head()
Out[66]:
x y c
27 412227.611383 5.653974e+06 1.000000
26 416961.447850 5.656817e+06 1.000000
25 412162.510344 5.657557e+06 1.414214
24 412269.786018 5.652359e+06 1.000000
23 415780.213229 5.650487e+06 1.000000
In [67]:
#user_count = gv.Dataset(df)
user_count = hv.Dataset(df)
In [68]:
from tagmaps.classes.utils import Utils
bounds = Utils.get_rectangle_bounds(points)
#bounds = tm.clusterer[TAGS].bounds
In [69]:
#%%opts Overlay [width=600 height=300] 
#%%opts [tools=['save', 'pan', 'wheel_zoom', 'box_zoom', 'reset'] size_index=2 color_index=2 xaxis=None yaxis=None]

points = gv.Points(user_count, kdims=['x', 'y'],
              vdims=['c'], extents=(bounds.lim_lng_min, bounds.lim_lat_min, bounds.lim_lng_max, bounds.lim_lat_max),
                   crs=ccrs.UTM(zone='33N'))

gv.tile_sources.CartoLight.opts(width=800, height=480) *\
points.opts(size=6, width=800, height=480, cmap='viridis', tools=['save', 'pan', 'wheel_zoom', 'box_zoom', 'reset'], size_index=2, color_index=2)
Out[69]:

bokeh_plot.png

Generate grid for visualization

Get the min, max (x, y) of the data and generate a grid of points.

In [70]:
# get the (x,y) range of the data and expand by 10%
def buffer_range(range_min, range_max, buf_by=0.1):
    buf = (range_max - range_min) * buf_by
    return (range_min - buf, range_max + buf)

x_range = buffer_range(df.x.min(), df.x.max())
y_range = buffer_range(df.y.min(), df.y.max())

# adjust these values for a higher/lower resolution grid
number_of_x_points=480
number_of_y_points=800
x = np.linspace(x_range[0], x_range[1], number_of_x_points)
y = np.linspace(y_range[0], y_range[1], number_of_y_points)
grid = np.meshgrid(x, y)
pts = np.stack([i.flatten() for i in grid], axis=1)
In [71]:
pts
Out[71]:
array([[ 398517.84587   , 5647384.09196832],
       [ 398568.45686451, 5647384.09196832],
       [ 398619.06785903, 5647384.09196832],
       ...,
       [ 422659.29025427, 5665255.28668173],
       [ 422709.90124878, 5665255.28668173],
       [ 422760.5122433 , 5665255.28668173]])

Interpolate to the grid using linear interpolation and display the results with matplotlib.

In [72]:
linear = interp.interpolate.InterpLinear(df.values)
z = linear.interp_to_pts(pts)
# I would prefer to be able to set up an interpolator like this:
# linear = interp.Interpolator(kind='linear', values=df.values)
zz = np.reshape(z, (len(y), len(x)))
xa = xr.DataArray(coords={'x': x, 'y': y}, dims=['x', 'y'], data=zz.T)
xa.hvplot(x='x', y='y', height=480, width=800, cmap='viridis')
Out[72]: