The task in this notebook is to extract and visualize common areas or regions of interest for a given topic from Location Based Social Media data (LBSM). On Social Media, people react to many things, and a significant share of these reactions can be seen as forms of valuation or attribution of meaning to to the physical environment. However, for visualizing such values for whole cities, there are several challenges involved:
For development purposes only: enable auto reload of develop dependencies, this will auto reload packages if code changes
%load_ext autoreload
%autoreload 2
#%autoreload tagmaps
Load the TagMaps package, which will serve as a base for filtering, cleaning and processing noisy social media data
from tagmaps import TagMaps
from tagmaps import EMOJI, TAGS, LOCATIONS, TOPICS
from tagmaps import LoadData
from tagmaps import BaseConfig
from tagmaps.classes.utils import Utils
cfg = BaseConfig()
Initialize tag maps from BaseConfig
tm = TagMaps(
tag_cluster=cfg.cluster_tags, emoji_cluster=cfg.cluster_emoji,
location_cluster=cfg.cluster_locations,
output_folder=cfg.output_folder, remove_long_tail=cfg.remove_long_tail,
limit_bottom_user_count=cfg.limit_bottom_user_count,
max_items=cfg.max_items, topic_cluster=True)
Read data form (already prepared) cleaned output. Cleaned output is a filtered version of original data of type UserPostLocation (UPL), which is a reference in tagmaps as 'CleanedPost'. A UPL simply has all posts of a single user at a single coordinate merged, e.g. a reduced list of terms, tags and emoji based on global occurrence (i.e. no duplicates). there're other metrics used throughout the tagmaps package, which will become handy for measuring topic reactions here:
tm.prepare_data(input_path='01_Input/Output_cleaned_flickr.csv')
Print stats for input data.
tm.global_stats_report()
tm.item_stats_report()
The next goal is to select reactions to given topics. TagMaps allows selecting posts for 4 different types:
Set basic plot to notebook mode and disable interactive plots for now
import matplotlib.pyplot as plt
#%matplotlib notebook
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['savefig.dpi'] = 120
mpl.rcParams['figure.dpi'] = 120
We can retrieve preview maps for the TOPIC dimension by supplying a list of terms, for example "elbe", "fluss" and "river" should give us an overview where such terms are used in our sample area
fig = tm.get_selection_map(TOPICS,["elbe","fluss","river"])
This approach is good for terms that clearly relate to specific and named physical aspects such as the "elbe" river, which exists only once. Selecting reactions from a list of terms is more problematic for general topics such as park, nature and "green-related" activities. Have a look:
fig = tm.get_selection_map(TOPICS,["grass","nature","park"])
We can see some areas of increased coverage: the Großer Garten, some parts of the Elbe River. But also the city center, which may show a representative picture. However, there's a good chance that we catched many false-positives and false-negatives with this clear-cut selection.
Topic Vectors A better approach would be to select Topics based on "Topic-Vectors", e.g. after a list of terms is supplied, a vector is calculated that represents this topic. This topic vector can then be used to select posts that have similar vectors. Using a wide or narrow search angle, the broadness of selection can be affected. This is currently not implemented here but an important future improvement.
We can visualize clusters for the selected topic using HDBSCAN. The important parameter for HDBSCAN is the cluster distance, which is chosen automatically by Tag Maps given the current scale of analysis. In the following, we manually reduce the cluster distance from 800 to 400, so we can a finer grained picture of topic clusters.
# tm.clusterer[TOPICS].autoselect_clusters = False
tm.clusterer[TOPICS].cluster_distance = 400
fig = tm.get_cluster_map(TOPICS,["grass","nature","park"])
Equally, we can get a map of clusters and cluster shapes (convex and concave hulls).
tm.clusterer[TOPICS].cluster_distance = 400
fig = tm.get_cluster_shapes_map(TOPICS,["grass","nature","park"])
Behind the scenes, Tag Maps utilizes the Single Linkage Tree from HDBSCAN to cut clusters at a specified distance. This tree shows the hierarchical structure for our topic and its spatial properties in the given area.
fig = tm.get_singlelinkagetree_preview(TOPICS,["grass", "park", "nature"])
The cluster distance can be manually set. If autoselect clusters is True, HDBSCAN will try to identify clusters based on its own algorithm.
tm.clusterer[TOPICS].cluster_distance = 400
# tm.clusterer[TOPICS].autoselect_clusters = False
Similarly, we can retrieve centroids of clusters. This shows again the unequal distribution of data:
fig = tm.clusterer[TOPICS].get_cluster_centroid_preview(["elbe","fluss","river"])
To get only the important clusters, we can drop all single cluster items:
# tm.clusterer[TOPICS].get_cluster_centroid_preview(["elbe", "fluss", "river"])
fig = tm.clusterer[TOPICS].get_cluster_centroid_preview(["elbe","fluss","river"], single_clusters=False)
# census example http://datashader.org/topics/census.html
# https://github.com/pyviz/datashader/issues/590
# http://datashader.org/topics/bay_trimesh.html
# https://github.com/pyviz/datashader/issues/582
# http://holoviews.org/reference/elements/bokeh/TriMesh.html
# Possible solution.
#
# 1. Create the canvas with the size (per pixel) desired.
# 2. use datashader aggreate, flipud
# 3. use rasterio to create the geotiff.
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf
import dask.dataframe as dd
import numpy as np
from datashader.utils import lnglat_to_meters
background = "black"
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9
from IPython.core.display import HTML, display
background = "black"
export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))
display(HTML("<style>.container { width:100% !important; }</style>"))
Select points
topic = "grass-nature-park"
points = tm.clusterer[TOPICS].get_np_points(
item=topic,
silent=True)
all_points = tm.clusterer[TOPICS].get_np_points()
project data to auto detected UTM Zone from TagMaps
import pyproj
# project lat/lng to UTM 33N (Dresden)
crs_wgs = tm.clusterer[TOPICS].crs_wgs
crs_proj = tm.clusterer[TOPICS].crs_proj
all_points = np.array([pyproj.transform(crs_wgs, crs_proj, point[0], point[1]) for point in all_points])
df = pd.DataFrame(data=all_points)
df.columns = ['longitude', 'latitude']
df.head()
# set plotting bounds (Dresden)
right = df.longitude.max()
left = df.longitude.min()
top = df.latitude.max()
bottom = df.latitude.min()
x_range, y_range = ((left,right), (bottom, top))
x_dist = x_range[1]-x_range[0]
y_dist = y_range[1]-y_range[0]
ratio = x_dist/y_dist
print(ratio)
plot_width = int(750)
plot_height = int(plot_width//ratio)
%%time
from colorcet import fire
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'longitude', 'latitude')
#img = tf.shade(agg)
#img = export(tf.shade(agg, cmap = cm(Greys9,0.2), how='eq_hist'), "census_gray_eq_hist")
#img = export(tf.shade(ds.Canvas().raster(agg, interpolate='nearest'), name="nearest-neighbor interpolation"))
img = export(tf.shade(agg, cmap = cm(fire,0.2), how='eq_hist'),"census_ds_fire_eq_hist")
img
Interactive DS Image:
import holoviews as hv
import datashader as ds
from holoviews.operation.datashader import aggregate, datashade
from datashader.bokeh_ext import InteractiveImage
import bokeh.plotting as bp
def image_callback(x_range, y_range, w, h, name=None):
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'longitude', 'latitude')
img = tf.shade(agg, cmap = cm(fire,0.2), how='eq_hist')
return export(tf.shade(agg, cmap = cm(fire,0.2), how='eq_hist'),"census_ds_fire_eq_hist")
bp.output_notebook()
p = bp.figure(tools='pan,wheel_zoom,reset', x_range=x_range, y_range=y_range, plot_width=plot_width, plot_height=plot_height)
InteractiveImage(p, image_callback)
IDW (inverse distance weighting) has been tested andfound not suitable. It is shown here for reasons of comparison (to KDE/HDBSCAN). To summarize the IDW approach:
- use clustered centroids from HDBSCAN
- inverse interpolate usercounts per cluster to raster
import param
import panel as pn
import holoviews as hv
import numpy as np
import xmsinterp as interp
import pandas as pd
import xarray as xr
import hvplot.pandas
import math
import hvplot.xarray
from holoviews.streams import Params
Get data from TagMaps
tm.clusterer[TOPICS].cluster_distance = 400
points, usercount = tm.clusterer[TOPICS].get_cluster_centroid_data(
["grass","park","garden"], projected=True, single_clusters=False)
# from datashader.utils import lnglat_to_meters as webm
df = pd.DataFrame(data=points)
df['usercount'] = usercount
# df = pd.DataFrame(data=points)
df.columns = ['longitude', 'latitude', 'usercount']
# coordinates in web mercator
# df.loc[:, 'longitude'], df.loc[:, 'latitude'] = webm(df.longitude,df.latitude)
# df.loc[:, 'longitude'], df.loc[:, 'latitude'] = lnglat_to_meters(df.longitude,df.latitude)
# reduce size of largest clusters
df['usercount'] = df['usercount'].apply(lambda x: math.sqrt(x))
#df.loc[:, 'usercount'] = math.sqrt(df.usercount).tolist()
df.head()
Rename columns and reverse order for plotting
#points = tm.clusterer[TOPICS]._get_np_points(
# item=["green", "nature", "park"],
# silent=True)
# df = pd.DataFrame(data=points, columns=["longitude", "latitude"])
# to study number of overlapping coordinates:
# df = df.groupby(["x", "y"]).size().to_frame('c').reset_index()
# to study number of users per cluster:
df = df.rename(columns={'longitude': 'x', 'latitude': 'y', 'usercount': 'c'})
# reverse order so that top points appear first on the plot
df = df.iloc[::-1]
df.head()
# plot preview in matplotlib
df.hvplot.scatter(x='x', y='y', c='c', height=480, width=800, cmap='jet', size=120, colorbar=True)
import xarray as xr
import numpy as np
import pandas as pd
import holoviews as hv
import geoviews as gv
import geoviews.feature as gf
import cartopy
from cartopy import crs as ccrs
import geoviews.tile_sources as gts
hv.notebook_extension('bokeh')
df.head()
#user_count = gv.Dataset(df)
user_count = hv.Dataset(df)
from tagmaps.classes.utils import Utils
bounds = Utils.get_rectangle_bounds(points)
#bounds = tm.clusterer[TAGS].bounds
#%%opts Overlay [width=600 height=300]
#%%opts [tools=['save', 'pan', 'wheel_zoom', 'box_zoom', 'reset'] size_index=2 color_index=2 xaxis=None yaxis=None]
points = gv.Points(user_count, kdims=['x', 'y'],
vdims=['c'], extents=(bounds.lim_lng_min, bounds.lim_lat_min, bounds.lim_lng_max, bounds.lim_lat_max),
crs=ccrs.UTM(zone='33N'))
gv.tile_sources.CartoLight.opts(width=800, height=480) *\
points.opts(size=6, width=800, height=480, cmap='viridis', tools=['save', 'pan', 'wheel_zoom', 'box_zoom', 'reset'], size_index=2, color_index=2)
Get the min, max (x, y) of the data and generate a grid of points.
# get the (x,y) range of the data and expand by 10%
def buffer_range(range_min, range_max, buf_by=0.1):
buf = (range_max - range_min) * buf_by
return (range_min - buf, range_max + buf)
x_range = buffer_range(df.x.min(), df.x.max())
y_range = buffer_range(df.y.min(), df.y.max())
# adjust these values for a higher/lower resolution grid
number_of_x_points=480
number_of_y_points=800
x = np.linspace(x_range[0], x_range[1], number_of_x_points)
y = np.linspace(y_range[0], y_range[1], number_of_y_points)
grid = np.meshgrid(x, y)
pts = np.stack([i.flatten() for i in grid], axis=1)
pts
Interpolate to the grid using linear interpolation and display the results with matplotlib.
linear = interp.interpolate.InterpLinear(df.values)
z = linear.interp_to_pts(pts)
# I would prefer to be able to set up an interpolator like this:
# linear = interp.Interpolator(kind='linear', values=df.values)
zz = np.reshape(z, (len(y), len(x)))
xa = xr.DataArray(coords={'x': x, 'y': y}, dims=['x', 'y'], data=zz.T)
xa.hvplot(x='x', y='y', height=480, width=800, cmap='viridis')