Based on data from YFCC100m dataset, this Notebook series explores a privacy-aware processing example for visualizing frequentation patterns in a 100x100km Grid (worldwide).
This is the first notebook in a series of four notebooks:
This notebook includes the following steps:
For easier replication, the notebooks make use of several docker containers. If you want to follow the notebooks without docker, you need to setup services (Postgres with citus HLL, jupyter lab) based on your own system configuration.
Define global settings
# GeoHash precision level
# to pre-aggregate spatial data (coordinates)
GEOHASH_PRECISION = 5
from pathlib import Path
# define path to output directory (figures etc.)
OUTPUT = Path.cwd().parents[0] / "out"
Create paths
def create_paths(
output: str = OUTPUT, subfolders = ["html", "pickles", "csv", "figures"]):
"""Create subfolder for results to be stored"""
output.mkdir(exist_ok=True)
for subfolder in subfolders:
Path(OUTPUT / subfolder).mkdir(exist_ok=True)
create_paths()
Load keys from .env
import os
from dotenv import load_dotenv
dotenv_path = Path.cwd().parents[0] / '.env'
load_dotenv(dotenv_path)
CRYPT_KEY = os.getenv("CRYPT_KEY")
USER_KEY = os.getenv("USER_KEY")
Below is a summary of requirements to get this notebook running:
If you want to run the notebook yourself, either get the LBSN JupyterLab Docker, or follow these steps to create an environment in conda for running this notebook. Suggested using miniconda
(in Windows, use WSL).
conda create -n yfcc_env -c conda-forge
conda activate yfcc_env
conda config --env --set channel_priority strict
conda config --show channel_priority # verify
# visualization dependencies
conda install -c conda-forge geopandas jupyterlab "geoviews-core=1.8.1" descartes mapclassify jupyter_contrib_nbextensions xarray
# only necessary when using data from db
conda install -c conda-forge python-dotenv psycopg2
to upgrade later, use:
conda upgrade -n yfcc_env --all -c conda-forge
Pinning geoviews to 1.8.1 should result in packages installed that are compatible with the code herein.
The YFCC100m dataset is a typical example of user-generated content that is made publicly available for anyone to use. It was published by Flickr in 2014 (Thomee et al. 2013). The core dataset is distributed as a compressed archive that contains only the metadata for about 100 Million photos and videos from Flickr published under a Creative Commons License. About 48 Million of the photos are geotagged.
Even if user-generated data is explicitly made public, like in this case, certain risks to privacy exist. Data may be re-purposed in contexts not originally anticipated by the users publishing the data. IBM, for example, re-purposed the YFCC100m dataset to fuel a facial-recognition project, without the consent of the people in the images.
The core dataset consists of two CSV files of about 14 GB which are hosted on Amazon AWS S3 bucket. This dataset contains a list of photos and videos and related meta data (titles, tags, timestamps etc.).
An overview of available columns in this dataset is provided by Deng & Li (2018). The table below contain a summary of the CSV columns.
Column | Metadata Description | Example |
---|---|---|
0 | row id | 0 |
1 | Photo/video identifier | 6185218911 |
2 | User NSID | 4e2f7a26a1dfbf165a7e30bdabf7e72a |
3 | User ID | 39019111@N00 |
4 | User nickname | guckxzs |
5 | Date taken | 2012-02-16 09:56:37.0 |
6 | Date uploaded | 1331840483 |
7 | Capture device | Canon+PowerShot+ELPH+310+HS |
8 | Title | IMG_0520 |
9 | Description ? | My vacation |
10 | User tags (comma-separated) | canon,canon+powershot+hs+310 |
11 | Machine tags (comma-separated) | landscape, hills, water |
12 | Longitude | -81.804885 |
13 | Latitude | 24.550558 |
14 | Accuracy Level (see [Flickr API][4]) | 12 |
15 | Photo/video page URL | http://www.flickr.com/photos/39089491@N00/6985418911/ |
16 | Photo/video download URL | http://farm8.staticflickr.com/7205/6985418911_df7747990d.jpg |
17 | License name | Attribution-NonCommercial-NoDerivs License |
18 | License URL | http://creativecommons.org/licenses/by-nc-nd/2.0/ |
19 | Photo/video server identifier | 7205 |
20 | Photo/video farm identifier | 8 |
21 | Photo/video secret | df7747990d |
22 | Photo/video secret original | 692d7e0a7f |
23 | Extension of the original photo | jpg |
24 | Marker (0 ¼ photo, 1 ¼ video) | 0 |
Table 1: Summary of Metadata for each CSV column available in the core dataset (yfcc100m_dataset.csv
).
Next to this core dataset, several expansion packs have been released that provide additional data:
As a means to enrich the spatial information, the places expansion set is available (but is not needed to follow the guides herein).
Column | Metadata Description | Example |
---|---|---|
0 | Photo/video identifier | 6985418911 |
1 | Place reference (null to multiple) | 24703176:Admiralty:Suburb,24703128:Central+and+Western:Territory |
Table 2: Summary of Metadata for each CSV column available in the places expansion dataset (yfcc100m_places.csv
).
To follow the examples used in this guide, follow the steps below.
Please follow the instructions provided on the official site.
Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved.
Any conversion from one structure to another requires the definition of mapping rules.
To demonstrate mapping of arbitrary LBSN data to the common LBSN structure scheme, we have built lbsntransform, a python package that includes several pre-defined mapping sets.
Note: Have a look at the exact mapping criteria for the Flickr YFCC100M dataset. The package also contains examples for other mappings (e.g. Twitter, Facebook Places), which can be extended further.
You'll also need a Postgres Database with the SQL Implementation of the LBSN Structure.
The easiest way is to use full-stack-lbsn, a shell script that starts the following docker services:
Tip: If you're familiar with git and docker, you can also clone the above repositories separately and start individual services as needed.
Windows user? If you're working with Windows, full-stack-lbsn will only work in Windows Subsystem for Linux (WSL). Even if it is possible to run Docker containers natively in Windows, we strongly recommend using WSL or WSL2.
After you have started the rawdb docker container, import Flickr YFCC CSVs to the database using lbsntransform.
lbsntransform --origin 21 \
--file_input \
--input_path_url "/data/flickr_yfcc100m/" \
--dbpassword_output "sample-password" \
--dbuser_output "postgres" \
--dbserveraddress_output "127.0.0.1:15432" \
--dbname_output "rawdb" \
--csv_delimiter $'\t' \
--file_type "csv" \
--zip_records
yfcc100m_places.csv
and yfcc100m_dataset.csv
are saved.$'\t'
via the command line yfcc100m_dataset.csv
and yfcc100m_places.csv
matches. This tells lbsntransform to concatenate both files on stream read.Note:
Reading the full dataset into the database will require at least 50 GB
of hard drive and, depending on your hardware, up to several days of processing.
You can read the dataset partially by adding --transferlimit 10000
,
to only read the first 10000 entries (e.g.).
We're using a two database setup:
To query original data from rawdb, and convert to hlldb, we're using a Materialized View.
Materialized View? Materialized views are static subsets extracted from larger PostgreSQL tables. MViews provide a performant filter for large queries. In a real-world example, data would typically be directly streamed and filtered, without the need for a Materialized View.
import os, sys
import psycopg2 # Postgres API
import geoviews as gv
import holoviews as hv
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from IPython.display import display, clear_output
Load helper module from ../py/module/tools.py. This also allows to import code from other jupyter notebooks, synced to *.py with jupytext.
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules import tools, preparations
Set pandas colwidth.
pd.set_option('display.max_colwidth', 25)
Define credentials as environment variables
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
# set connection variables
db_host = "rawdb"
db_port = "5432"
db_name = "rawdb"
Connect to raw database:
db_connection = psycopg2.connect(
host=db_host,
port=db_port,
dbname=db_name,
user=db_user,
password=db_pass
)
Test connection:
db_query = """
SELECT 1;
"""
# create pandas DataFrame from database data
df = pd.read_sql_query(db_query, db_connection)
display(df.head())
Simplify query access:
db_conn = tools.DbConn(db_connection)
db_conn.query("SELECT 1;")
If any SQL results in an error, the cursor cannot be used again. In this case, run db_connection.rollback()
once, to reset the cursor.
Create a new schema called mviews
and
update Postgres search_path
, to include new schema:
sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE rawdb
SET search_path = "$user",
social,
spatial,
temporal,
topical,
interlinkage,
extensions,
mviews;"""
Since the above query will not return any result, we'll directly use the psycopg2 cursor object:
cur = db_connection.cursor()
cur.execute(sql_query)
print(cur.statusmessage)
HyperLogLog uses MurMurHash, which is a non-cryptographic hashing algorithm. This opens up some vulnerabilities such as described by Desfontaines et al. (2018).
As an additional measurement to prevent re-identification of users in the final dataset, we are are using Postgres pgcrypto extension to hash any IDs with a secure, unique key.
References:
Desfontaines, D., Lochbihler, A., & Basin, D. (2018). Cardinality Estimators do not Preserve Privacy. 1–21.
Create the pgcrypto extension:
sql_query = "CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA extensions;"
cur.execute(sql_query)
print(cur.statusmessage)
Prepare cryptographic hash function. The following function
will take an id
and a key
(the seed value)
to produce a new, unique hash that is returned in hex
encoding.
sql_query = """
/* Produce pseudonymized hash of input id with skey
* - using skey as seed value
* - sha256 cryptographic hash function
* - encode in hex to reduce length of hash
* - remove trailing '=' from base64 string
* - return as text
* - optional: truncate x characters from string
*/
CREATE OR REPLACE FUNCTION
extensions.crypt_hash (IN id text, IN skey text)
RETURNS text
AS $$
SELECT
RTRIM(
ENCODE(
HMAC(
id::bytea,
skey::bytea,
'sha256'),
'base64'),
'=')
$$
LANGUAGE SQL
STRICT;
"""
cur.execute(sql_query)
print(cur.statusmessage)
Note: Cryptographic hashing alone will only produce pseudonymized data, since any id still relates to a single user (or post, etc.). Therefore, pseudonymization is considered a weak measure, which can be easily reversed, e.g. through [rainbow tables] or context lookup.
It is used here as an additional means to protect HLL sets from intersection attacks, e.g. as is discussed by Desfontaines et al. (2018).
What is a seed value? The seed value (skey
) is a secret that is used, together with the encryption
function (e.g. sha256) to produce a unique, collision-free output value (the hashed id).
The same ID will be converted to a different hash if the seed value is changed.
Therefore, in our case, the seed value must remain the same during the entire processing
of data. In this case, the seed is called a key. This key can be destroyed afterwards,
if no subsequent updates are necessary.
In this step, several filters are applied to significantly reduce data, as a provisional measure to reduce privacy risks.
The topic selection query below the following data reduction steps:
post_guid
, user_guid
and userday
using the crypt_hash
function defined earlierpost_create_date
to yyyy-MM-dd
, concat with user_guidFirst, set the secret key (CRYPT_KEY) for the crypt function, ideally by an external config file.
Optional cleanup step:
sql_query = f"""
DROP MATERIALIZED VIEW mviews.spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02};
"""
cur.execute(sql_query)
print(cur.statusmessage)
%%time
sql_query = f"""
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02} AS
SELECT extensions.crypt_hash(t1.post_guid, '{CRYPT_KEY}') as "post_guid",
ST_Y(ST_PointFromGeoHash(ST_GeoHash(t1.post_latlng, {GEOHASH_PRECISION}), {GEOHASH_PRECISION})) As "latitude",
ST_X(ST_PointFromGeoHash(ST_GeoHash(t1.post_latlng, {GEOHASH_PRECISION}), {GEOHASH_PRECISION})) As "longitude",
extensions.crypt_hash(t1.user_guid, '{CRYPT_KEY}') as "user_guid",
extensions.crypt_hash(t1.user_guid || to_char(t1.post_create_date, 'yyyy-MM-dd'), '{CRYPT_KEY}') as "userday",
t1.post_geoaccuracy
FROM topical.post t1
WHERE t1.post_geoaccuracy IN ('place', 'latlng', 'city');
"""
cur.execute(sql_query)
print(cur.statusmessage)
Info: GeoHash?
A GeoHash of 5
means, coordinates are reduced to 5 decimal digits
maximum length of lat/lng. Compare the following table:
Precision (number of digits) | Distance of Adjacent Cell in Meters |
---|---|
1 | 5003530 |
2 | 625441 |
3 | 123264 |
4 | 19545 |
5 | 3803 |
6 | 610 |
7 | 118 |
8 | 19 |
9 | 3.71 |
10 | 0.6 |
Table: GeoHash length and corresponding geoaccuracy in meters (Source: Wikipedia).
First, create a connection from hlldb to rawdb.
On rawdb, create an lbsn_reader
with read-only privileges for schema mviews
.
sql_query = """
SELECT 1 FROM pg_roles WHERE rolname='lbsn_reader'
"""
result = db_conn.query(sql_query)
if result.empty:
# if user does not exist
sql_query = f"""
CREATE USER lbsn_reader WITH
LOGIN
INHERIT
PASSWORD '{USER_KEY}';
GRANT CONNECT ON DATABASE rawdb TO lbsn_reader;
GRANT USAGE ON SCHEMA mviews TO lbsn_reader;
ALTER DEFAULT PRIVILEGES IN SCHEMA mviews GRANT SELECT ON TABLES TO lbsn_reader;
"""
cur.execute(sql_query)
print(cur.statusmessage)
By using Foreign Table, this step will establish the connection between hlldb to rawdb.
On hlldb, install postgres_fdw extension:
sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur.execute(sql_query)
print(cur.statusmessage)
Create Foreign Server on hlldb:
db_connection_hll = psycopg2.connect(
host="hlldb",
port=db_port,
dbname="hlldb",
user=db_user,
password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)
sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
host 'rawdb',
dbname 'rawdb',
port '5432',
keepalives '1',
keepalives_idle '30',
keepalives_interval '10',
keepalives_count '5',
fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
SERVER lbsnraw
OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Import foreign table definition on the hlldb.
sql_query = f"""
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'extensions'
AND table_name = 'spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02}'
);
"""
result = db_conn_hll.query(sql_query)
if not result["exists"][0]:
# only import table
# if it hasn't been imported already
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
LIMIT TO (spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02})
FROM SERVER lbsnraw
INTO extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Tip: Optionally: optimize chunk size
Depending on your hardware, optimizing Postgres
fetch_size
may increase processing speed:
ALTER SERVER lbsnraw
OPTIONS (SET fetch_size '50000');
We're going to use spatial.latlng
from the HLL Structure
definition. The structure for this table is already available,
by default, in hlldb.
Create table structure:
sql_query = """
CREATE TABLE IF NOT EXISTS spatial.latlng (
latitude float,
longitude float,
PRIMARY KEY (latitude, longitude),
latlng_geom geometry(Point, 4326) NOT NULL)
INHERITS (
social.user_hll, -- e.g. number of users/latlng (=upl)
topical.post_hll, -- e.g. number of posts/latlng
temporal.date_hll -- e.g. number of dates/latlng
);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.
From a privacy perspective, for example, it is recommended to disable explicit mode.
Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.
To disable explicit mode:
sql_query = """
SELECT hll_set_defaults(11, 5, 0, 1);
"""
db_conn_hll.query(sql_query)
From now on, HLL sets will directly be promoted to sparse.
This is the actual data collection and aggregation step. In the query below, different metrics are collected that are typical for LBSM visual analytics (postcount, usercount, userdays).
Make sure that no data exists in table spatial.latlng. Optional cleanup step:
sql_query = """
TRUNCATE spatial.latlng;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
sql_query = """
SELECT * from spatial.latlng limit 10;
"""
results = db_conn_hll.query(sql_query)
results.empty
To test: uncomment --LIMIT 100
below
%%time
if results.empty:
sql_query = f"""
INSERT INTO spatial.latlng(
latitude,
longitude,
user_hll,
post_hll,
date_hll,
latlng_geom)
SELECT latitude,
longitude,
hll_add_agg(hll_hash_text(user_guid)) as user_hll,
hll_add_agg(hll_hash_text(post_guid)) as post_hll,
hll_add_agg(hll_hash_text(userday)) as date_hll,
ST_SetSRID(ST_MakePoint(longitude, latitude), 4326) as latlng_geom
FROM extensions.spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02}
--LIMIT 100
GROUP BY latitude, longitude;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
else:
print("Table contains already data. Please cleanup first (TRUNCATE spatial.latlng;)")
Commit changes to db and close connection:
db_connection_hll.commit()
db_connection_hll.close ()
This query will take some time, depending on your machine.
If, for some reason, the query does not work - try directly connecting to the psql console:
docker exec -it lbsn-hlldb /bin/bash
psql -h localhost -p 5432 -U postgres hlldb
Convert notebook to HTML
!jupyter nbconvert --to html_toc \
--output-dir=../out/html ./01_preparations.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False # create single output file
In the raw and hll notebook, we'll aggregate data per 100km grid. Follow in 02_yfcc_gridagg_raw.ipynb and 03_yfcc_gridagg_hll.ipynb