Workshop: Social Media, Data Science, & Cartograpy
Alexander Dunkel, Madalina Gugulica
First step: Enable worker_env in jupyter lab
!cd .. && sh activate_workshop_env.sh
This is the first notebook in a series of four notebooks:
Open these notebooks through the file explorer on the left side.
We are creating several output graphics and temporary files.
These will be stored in the subfolder notebooks/out/.
from pathlib import Path
OUTPUT = Path.cwd() / "out"
OUTPUT.mkdir(exist_ok=True)
To reduce the code shown in this notebook, some helper methods are made available in a separate file.
Load helper module from ../py/module/tools.py
.
import sys
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules import tools
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Load Instagram data for a specific Hashtag.
hashtag = "park"
query_url = f'https://www.instagram.com/explore/tags/{hashtag}/?__a=1'
from IPython.core.display import HTML
display(HTML(tools.print_link(query_url, hashtag)))
First, try to get the json-data without login. This may or may not work:
import requests
json_text = None
response = requests.get(
url=query_url, headers=tools.HEADER)
if not response.status_code == 429 and not "/login/" in response.url:
json_text = response.text
print("Loaded live json")
Optionally, write to temporary file:
if json_text:
with open(OUTPUT / f"live_{hashtag}.json", 'w') as f:
f.write(json_text)
If the the url refers to the "login" page (or status_code 429), access is blocked. In this case, get the sample json:
if not json_text:
# check if manual json exists
local_json = [json for json in OUTPUT.glob('*.json')]
if len(local_json) > 0:
# read local json
with open(local_json[0], 'r') as f:
json_text = f.read()
print("Loaded local json")
If neither live nor local json has been loaded, load sample json:
if not json_text:
sample_url = tools.get_sample_url()
sample_json_url = f'{sample_url}/download?path=%2F&files=park.json'
response = requests.get(url=sample_json_url)
json_text = response.text
print("Loaded sample json")
Turn text into json format:
import json
json_data = json.loads(json_text)
Have a peek at the returned data.
print(json.dumps(json_data, indent=2)[0:550])
The json data is nested. Values can be accessed with dictionary keys.
total_cnt = json_data["graphql"]["hashtag"]["edge_hashtag_to_media"].get("count")
display(HTML(
f'''<details><summary>Working with the JSON Format</summary>
The json data is nested. Values can be accessed with dictionary keys. <br>For example,
for the hashtag <strong>{hashtag}</strong>,
the total count of available images on Instagram is <strong>{total_cnt:,.0f}</strong>.
</details>
'''))
Another more flexible data analytics interface is available with pandas.DataFrame()
.
import pandas as pd
pd.set_option("display.max_columns", 4)
df = pd.json_normalize(
json_data["graphql"]["hashtag"]["edge_hashtag_to_media"]["edges"],
errors="ignore")
pd.reset_option("display.max_columns")
df.transpose()
View the first few images
First, define a function.
resize
function
and the ImageFilter.BLUR
filter. The Image is processed in-memory.
Afterwards, plt.subplot()
is used to plot images in a row. Can you modify the code to plot images
in a multi-line grid?
from typing import List
import matplotlib.pyplot as plt
from PIL import Image, ImageFilter
from io import BytesIO
def image_grid_fromurl(url_list: List[str]):
"""Load and show images in a grid from a list of urls"""
count = len(url_list)
plt.figure(figsize=(11, 18))
for ix, url in enumerate(url_list):
r = requests.get(url=url)
i = Image.open(BytesIO(r.content))
resize = (150, 150)
i = i.resize(resize)
i = i.filter(ImageFilter.BLUR)
ax = plt.subplot(1, count, ix + 1)
ax.axis('off')
plt.imshow(i)
Use the function to display images from "node.display_url" column.
image_grid_fromurl(
df["node.thumbnail_src"][:10])
lat = 51.03711
lng = 13.76318
Get list of nearby places using commons.wikimedia.org's API:
query_url = f'https://commons.wikimedia.org/w/api.php'
params = {
"action":"query",
"list":"geosearch",
"gsprimary":"all",
"gsnamespace":14,
"gslimit":50,
"gsradius":1000,
"gscoord":f'{lat}|{lng}',
"format":"json"
}
response = requests.get(
url=query_url, params=params)
if response.status_code == 200:
print(f"Query successful. Query url: {response.url}")
json_data = json.loads(response.text)
print(json.dumps(json_data, indent=2)[0:500])
Get List of places.
location_dict = json_data["query"]["geosearch"]
Turn into DataFrame.
df = pd.DataFrame(location_dict)
display(df.head())
df.shape
If we have queried 50 records, we have reached the limit specified in our query. There is likely more available, which would need to be queried using subsequent queries (e.g. by grid/bounding box). However, for the workshop, 50 locations are enough.
Modify data.: Replace "Category:" in column title.
df["title"] = df["title"].str.replace("Category:", "")
df.rename(
columns={"title":"name"},
inplace=True)
Turn DataFrame into a GeoDataFrame
import geopandas as gp
gdf = gp.GeoDataFrame(
df, geometry=gp.points_from_xy(df.lon, df.lat))
Set projection, reproject
CRS_PROJ = "epsg:3857" # Web Mercator
CRS_WGS = "epsg:4326" # WGS1984
gdf.crs = CRS_WGS # Set projection
gdf = gdf.to_crs(CRS_PROJ) # Project
gdf.head()
Display location on a map
Import contextily, which provides static background tiles to be used in matplot-renderer.
import contextily as cx
1. Create a bounding box for the map
x = gdf.loc[0].geometry.x
y = gdf.loc[0].geometry.y
margin = 1000 # meters
bbox_bottomleft = (x - margin, y - margin)
bbox_topright = (x + margin, y + margin)
gdf.loc[0]
is the loc-indexer from pandas. It means: access the first record of the (Geo)DataFrame..geometry.x
is used to access the (projected) x coordinate geometry (point). This is only available for GeoDataFrame (geopandas)2. Create point layer, annotate and plot.
from matplotlib.patches import ArrowStyle
# create the point-layer
ax = gdf.plot(
figsize=(10, 15),
alpha=0.5,
edgecolor="black",
facecolor="red",
markersize=300)
# set display x and y limit
ax.set_xlim(
bbox_bottomleft[0], bbox_topright[0])
ax.set_ylim(
bbox_bottomleft[1], bbox_topright[1])
# turn of axes display
ax.set_axis_off()
# add callouts
# for the name of the places
for index, row in gdf.iterrows():
# offset labels by odd/even
label_offset_x = 30
if (index % 2) == 0:
label_offset_x = -100
label_offset_y = -30
if (index % 4) == 0:
label_offset_y = 100
ax.annotate(
text=row["name"],
xy=(row["geometry"].x, row["geometry"].y),
xytext=(label_offset_x, label_offset_y),
textcoords="offset points",
bbox=dict(
boxstyle='round,pad=0.5',
fc='white',
alpha=0.5),
arrowprops=dict(
mutation_scale=4,
arrowstyle=ArrowStyle(
"simple, head_length=2, head_width=2, tail_width=.2"),
connectionstyle=f'arc3,rad=-0.3',
color='black',
alpha=0.2))
cx.add_basemap(
ax, alpha=0.5,
source=cx.providers.OpenStreetMap.Mapnik)