Workshop: Social Media, Data Science, & Cartograpy
Alexander Dunkel, Madalina Gugulica
First step: Enable worker_env in jupyter lab
!cd .. && sh activate_workshop_env.sh
This is the first notebook in a series of four notebooks:
Open these notebooks through the file explorer on the left side.
We are creating several output graphics and temporary files.
These will be stored in the subfolder notebooks/out/.
from pathlib import Path
OUTPUT = Path.cwd() / "out"
OUTPUT.mkdir(exist_ok=True)
To reduce the code shown in this notebook, some helper methods are made available in a separate file.
Load helper module from ../py/module/tools.py
.
import sys
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules import tools
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Load Instagram data for a specific Hashtag.
hashtag = "park"
query_url = f'https://www.instagram.com/explore/tags/{hashtag}/?__a=1'
from IPython.core.display import HTML
display(HTML(tools.print_link(query_url, hashtag)))
First, try to get the json-data without login. This may or may not work:
import requests
json_text = None
response = requests.get(
url=query_url, headers=tools.HEADER)
if not response.status_code == 429 and not "/login/" in response.url:
json_text = response.text
print("Loaded live json")
Optionally, write to temporary file:
if json_text:
with open(OUTPUT / f"live_{hashtag}.json", 'w') as f:
f.write(json_text)
If the the url refers to the "login" page (or status_code 429), access is blocked. In this case, get the sample json:
if not json_text:
# check if manual json exists
local_json = [json for json in OUTPUT.glob('*.json')]
if len(local_json) > 0:
# read local json
with open(local_json[0], 'r') as f:
json_text = f.read()
print("Loaded local json")
If neither live nor local json has been loaded, load sample json:
if not json_text:
sample_url = tools.get_sample_url()
sample_json_url = f'{sample_url}/download?path=%2F&files=park.json'
response = requests.get(url=sample_json_url)
json_text = response.text
print("Loaded sample json")
Turn text into json format:
import json
json_data = json.loads(json_text)
Have a peek at the returned data.
print(json.dumps(json_data, indent=2)[0:550])
The json data is nested. Values can be accessed with dictionary keys.
total_cnt = json_data["graphql"]["hashtag"]["edge_hashtag_to_media"].get("count")
display(HTML(
f'''<details><summary>Working with the JSON Format</summary>
The json data is nested. Values can be accessed with dictionary keys. <br>For example,
for the hashtag <strong>{hashtag}</strong>,
the total count of available images on Instagram is <strong>{total_cnt:,.0f}</strong>.
</details>
'''))
Another more flexible data analytics interface is available with pandas.DataFrame()
.
import pandas as pd
pd.set_option("display.max_columns", 4)
df = pd.json_normalize(
json_data["graphql"]["hashtag"]["edge_hashtag_to_media"]["edges"],
errors="ignore")
pd.reset_option("display.max_columns")
df.transpose()
View the first few images
First, define a function.
resize
function
and the ImageFilter.BLUR
filter. The Image is processed in-memory.
Afterwards, plt.subplot()
is used to plot images in a row. Can you modify the code to plot images
in a multi-line grid?
from typing import List
import matplotlib.pyplot as plt
from PIL import Image, ImageFilter
from io import BytesIO
def image_grid_fromurl(url_list: List[str]):
"""Load and show images in a grid from a list of urls"""
count = len(url_list)
plt.figure(figsize=(11, 18))
for ix, url in enumerate(url_list):
r = requests.get(url=url)
i = Image.open(BytesIO(r.content))
resize = (150, 150)
i = i.resize(resize)
i = i.filter(ImageFilter.BLUR)
ax = plt.subplot(1, count, ix + 1)
ax.axis('off')
plt.imshow(i)
Use the function to display images from "node.display_url" column.
image_grid_fromurl(
df["node.thumbnail_src"][:10])
lat = 51.03711
lng = 13.76318
Get list of nearby places using commons.wikimedia.org's API:
query_url = f'https://commons.wikimedia.org/w/api.php'
params = {
"action":"query",
"list":"geosearch",
"gsprimary":"all",
"gsnamespace":14,
"gslimit":50,
"gsradius":1000,
"gscoord":f'{lat}|{lng}',
"format":"json"
}
response = requests.get(
url=query_url, params=params)
if response.status_code == 200:
print(f"Query successful. Query url: {response.url}")
json_data = json.loads(response.text)
print(json.dumps(json_data, indent=2)[0:500])
Get List of places.
location_dict = json_data["query"]["geosearch"]
Turn into DataFrame.
df = pd.DataFrame(location_dict)
display(df.head())
df.shape
If we have queried 50 records, we have reached the limit specified in our query. There is likely more available, which would need to be queried using subsequent queries (e.g. by grid/bounding box). However, for the workshop, 50 locations are enough.
Modify data.: Replace "Category:" in column title.
df["title"] = df["title"].str.replace("Category:", "")
df.rename(
columns={"title":"name"},
inplace=True)
Turn DataFrame into a GeoDataFrame
import geopandas as gp
gdf = gp.GeoDataFrame(
df, geometry=gp.points_from_xy(df.lon, df.lat))
Set projection, reproject
CRS_PROJ = "epsg:3857" # Web Mercator
CRS_WGS = "epsg:4326" # WGS1984
gdf.crs = CRS_WGS # Set projection
gdf = gdf.to_crs(CRS_PROJ) # Project
gdf.head()
Display location on a map
Import contextily, which provides static background tiles to be used in matplot-renderer.
import contextily as cx
1. Create a bounding box for the map
x = gdf.loc[0].geometry.x
y = gdf.loc[0].geometry.y
margin = 1000 # meters
bbox_bottomleft = (x - margin, y - margin)
bbox_topright = (x + margin, y + margin)
gdf.loc[0]
is the loc-indexer from pandas. It means: access the first record of the (Geo)DataFrame..geometry.x
is used to access the (projected) x coordinate geometry (point). This is only available for GeoDataFrame (geopandas)2. Create point layer, annotate and plot.
from matplotlib.patches import ArrowStyle
# create the point-layer
ax = gdf.plot(
figsize=(10, 15),
alpha=0.5,
edgecolor="black",
facecolor="red",
markersize=300)
# set display x and y limit
ax.set_xlim(
bbox_bottomleft[0], bbox_topright[0])
ax.set_ylim(
bbox_bottomleft[1], bbox_topright[1])
# turn of axes display
ax.set_axis_off()
# add callouts
# for the name of the places
for index, row in gdf.iterrows():
# offset labels by odd/even
label_offset_x = 30
if (index % 2) == 0:
label_offset_x = -100
label_offset_y = -30
if (index % 4) == 0:
label_offset_y = 100
ax.annotate(
text=row["name"],
xy=(row["geometry"].x, row["geometry"].y),
xytext=(label_offset_x, label_offset_y),
textcoords="offset points",
bbox=dict(
boxstyle='round,pad=0.5',
fc='white',
alpha=0.5),
arrowprops=dict(
mutation_scale=4,
arrowstyle=ArrowStyle(
"simple, head_length=2, head_width=2, tail_width=.2"),
connectionstyle=f'arc3,rad=-0.3',
color='black',
alpha=0.2))
cx.add_basemap(
ax, alpha=0.5,
source=cx.providers.OpenStreetMap.Mapnik)
Have a look at the available basemaps:
cx.providers.keys()
And a look at the basemaps for a specific provider:
cx.providers.CartoDB.keys()
Plot with Holoviews/ Geoviews (Bokeh)
import holoviews as hv
import geoviews as gv
from cartopy import crs as ccrs
hv.notebook_extension('bokeh')
Create point layer:
places_layer = gv.Points(
df,
kdims=['lon', 'lat'],
vdims=['name', 'pageid'],
label='Place')
Make an additional query, to request pictures shown in the area from commons.wikimedia.org
query_url = f'https://commons.wikimedia.org/w/api.php'
params = {
"action":"query",
"list":"geosearch",
"gsprimary":"all",
"gsnamespace":6,
"gsradius":1000,
"gslimit":500,
"gscoord":f'{lat}|{lng}',
"format":"json"
}
response = requests.get(
url=query_url, params=params)
print(response.url)
json_data = json.loads(response.text)
df_images = pd.DataFrame(json_data["query"]["geosearch"])
df_images.head()
Set Column-type as integer:
df_images["pageid"] = df_images["pageid"].astype(int)
Set the index to pageid:
df_images.set_index("pageid", inplace=True)
df_images.head()
Load additional data from API: Place Image URLs
params = {
"action":"query",
"prop":"imageinfo",
"iiprop":"timestamp|user|userid|comment|canonicaltitle|url",
"iiurlwidth":200,
"format":"json"
}
See the full list of available attributes.
Query the API for a random sample of 50 images:
%%time
from IPython.display import clear_output
from datetime import datetime
count = 0
df_images["userid"] = 0 # set default value
for pageid, row in df_images.sample(n=50).iterrows():
params["pageids"] = pageid
response = requests.get(
url=query_url, params=params)
json_data = json.loads(response.text)
image_json = json_data["query"]["pages"][str(pageid)]
if not image_json:
continue
image_info = image_json.get("imageinfo")
if image_info:
thumb_url = image_info[0].get("thumburl")
count += 1
df_images.loc[pageid, "thumb_url"] = thumb_url
clear_output(wait=True)
display(HTML(
f"Queried {count} image urls, "
f"<a href='{response.url}'>last query-url</a>."))
# assign additional attributes
df_images.loc[pageid, "user"] = image_info[0].get("user")
df_images.loc[pageid, "userid"] = image_info[0].get("userid")
timestamp = pd.to_datetime(image_info[0].get("timestamp"))
df_images.loc[pageid, "timestamp"] = timestamp
df_images.loc[pageid, "title"] = image_json.get("title")
%%time
?%%time
is one of them. It will output the total execution time of a cell.
df_images[
df_images["userid"] != 0].head()
df_images["userid"] != 0
returns True for all records where "iserid" is not 0 (the default value).slice
records using the boolean indexing: df_images[Condition=True]
Next (optional) step: Save queried data to CSV
to_csv()
for archive purposes ..to_pickle()
for intermediate, temporary files stored and loaded to/from diskdf_images[df_images["userid"] != 0].to_csv(
OUTPUT / "wikimedia_commons_sample.csv")
Create two point layers, one for images with url and one for those without:
images_layer_thumbs = gv.Points(
df_images[df_images["thumb_url"].notna()],
kdims=['lon', 'lat'],
vdims=['thumb_url', 'user', 'timestamp', 'title'],
label='Picture (with thumbnail)')
images_layer_nothumbs = gv.Points(
df_images[df_images["thumb_url"].isna()],
kdims=['lon', 'lat'],
label='Picture')
margin = 500 # meters
bbox_bottomleft = (x - margin, y - margin)
bbox_topright = (x + margin, y + margin)
from bokeh.models import HoverTool
from typing import Dict, Optional
def get_custom_tooltips(
items: Dict[str, str], thumbs_col: Optional[str] = None) -> str:
"""Compile HoverTool tooltip formatting with items to show on hover
including showing a thumbail image from a url"""
tooltips = ""
if items:
tooltips = "".join(
f'<div><span style="font-size: 12px;">'
f'<span style="color: #82C3EA;">{item}:</span> '
f'@{item}'
f'</span></div>' for item in items)
tooltips += f'''
<div><img src="@{thumbs_col}" alt="" style="height:170px"></img></div>
'''
return tooltips
def set_active_tool(plot, element):
"""Enable wheel_zoom in bokeh plot by default"""
plot.state.toolbar.active_scroll = plot.state.tools[0]
# prepare custom HoverTool
tooltips = get_custom_tooltips(
thumbs_col='thumb_url', items=['title', 'user', 'timestamp'])
hover = HoverTool(tooltips=tooltips)
gv_layers = hv.Overlay(
gv.tile_sources.EsriImagery * \
places_layer.opts(
tools=['hover'],
size=20,
line_color='black',
line_width=0.1,
fill_alpha=0.8,
fill_color='red') * \
images_layer_nothumbs.opts(
size=5,
line_color='black',
line_width=0.1,
fill_alpha=0.8,
fill_color='lightblue') * \
images_layer_thumbs.opts(
size=10,
line_color='black',
line_width=0.1,
fill_alpha=0.8,
fill_color='lightgreen',
tools=[hover])
)
Store map as static HTML file
gv_layers.opts(
projection=ccrs.GOOGLE_MERCATOR,
title=df.loc[0, "name"],
responsive=True,
xlim=(bbox_bottomleft[0], bbox_topright[0]),
ylim=(bbox_bottomleft[1], bbox_topright[1]),
data_aspect=0.45, # maintain fixed aspect ratio during responsive resize
hooks=[set_active_tool])
hv.save(
gv_layers, OUTPUT / f'geoviews_map.html', backend='bokeh')
Display in-line view of the map:
gv_layers.opts(
width=800,
height=480,
responsive=False,
hooks=[set_active_tool],
title=df.loc[0, "name"],
projection=ccrs.GOOGLE_MERCATOR,
data_aspect=1,
xlim=(bbox_bottomleft[0], bbox_topright[0]),
ylim=(bbox_bottomleft[1], bbox_topright[1])
)
Steps:
&>/dev/null
)!jupyter nbconvert --to Html_toc \
--output-dir=./out/ ./01_raw_intro.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-
df_images
dataframe, see Pandas Visualization