Privacy-aware grid aggregation: Preparations (YFCC100m data)

Introduction¶

Based on data from YFCC100m dataset, this Notebook series explores a privacy-aware processing example for visualizing frequentation patterns in a 100x100km Grid (worldwide).

This is the first notebook in a series of four notebooks:

1) the Preparations (01_preparations.ipynb) Preparations for importing and pre-processing YFCC100m.
2) the RAW Notebook (02_yfcc_gridagg_raw.ipynb) demonstrates how a typical grid-based visualization looks like when using the raw lbsn structure and
3) the HLL Notebook (03_yfcc_gridagg_hll.ipynb) demonstrates the same visualization using the privacy-aware hll lbsn structure
4) the Interpretation (04_interpretation.ipynb) illustrates how to create interactive graphics for comparison of raw and hll results; intersection of published data.

This notebook includes the following steps:

getting the YFCC dataset (CSV)
importing the YFCC dataset to Postgres (raw database), to a common format that makes it easier to work with the data
preparation of raw data that is used for the grid aggregation
conversion of raw data to hll format

Preparations¶

For easier replication, the notebooks make use of several docker containers. If you want to follow the notebooks without docker, you need to setup services (Postgres with citus HLL, jupyter lab) based on your own system configuration.

Parameters¶

Define global settings

# GeoHash precision level
# to pre-aggregate spatial data (coordinates)
GEOHASH_PRECISION = 5

from pathlib import Path
# define path to output directory (figures etc.)
OUTPUT = Path.cwd().parents[0] / "out"

Create paths

def create_paths(
    output: str = OUTPUT, subfolders = ["html", "pickles", "csv", "figures"]):
    """Create subfolder for results to be stored"""
    output.mkdir(exist_ok=True)
    for subfolder in subfolders:
        Path(OUTPUT / subfolder).mkdir(exist_ok=True)

create_paths()

Load keys from .env

import os
from dotenv import load_dotenv

dotenv_path = Path.cwd().parents[0] / '.env'
load_dotenv(dotenv_path)
CRYPT_KEY = os.getenv("CRYPT_KEY")
USER_KEY = os.getenv("USER_KEY")

Install dependencies¶

Below is a summary of requirements to get this notebook running:

If you want to run the notebook yourself, either get the LBSN JupyterLab Docker, or follow these steps to create an environment in conda for running this notebook. Suggested using miniconda (in Windows, use WSL).

conda create -n yfcc_env -c conda-forge
conda activate yfcc_env
conda config --env --set channel_priority strict
conda config --show channel_priority # verify
# visualization dependencies
conda install -c conda-forge geopandas jupyterlab "geoviews-core=1.8.1" descartes mapclassify jupyter_contrib_nbextensions xarray
# only necessary when using data from db
conda install -c conda-forge python-dotenv psycopg2

to upgrade later, use:

conda upgrade -n yfcc_env --all -c conda-forge

Pinning geoviews to 1.8.1 should result in packages installed that are compatible with the code herein.

The YFCC100m dataset¶

The YFCC100m dataset is a typical example of user-generated content that is made publicly available for anyone to use. It was published by Flickr in 2014 (Thomee et al. 2013). The core dataset is distributed as a compressed archive that contains only the metadata for about 100 Million photos and videos from Flickr published under a Creative Commons License. About 48 Million of the photos are geotagged.

Even if user-generated data is explicitly made public, like in this case, certain risks to privacy exist. Data may be re-purposed in contexts not originally anticipated by the users publishing the data. IBM, for example, re-purposed the YFCC100m dataset to fuel a facial-recognition project, without the consent of the people in the images.

Structure¶

The core dataset consists of two CSV files of about 14 GB which are hosted on Amazon AWS S3 bucket. This dataset contains a list of photos and videos and related meta data (titles, tags, timestamps etc.).

An overview of available columns in this dataset is provided by Deng & Li (2018). The table below contain a summary of the CSV columns.

Column	Metadata Description	Example
0	row id	0
1	Photo/video identifier	6185218911
2	User NSID	4e2f7a26a1dfbf165a7e30bdabf7e72a
3	User ID	39019111@N00
4	User nickname	guckxzs
5	Date taken	2012-02-16 09:56:37.0
6	Date uploaded	1331840483
7	Capture device	Canon+PowerShot+ELPH+310+HS
8	Title	IMG_0520
9	Description ?	My vacation
10	User tags (comma-separated)	canon,canon+powershot+hs+310
11	Machine tags (comma-separated)	landscape, hills, water
12	Longitude	-81.804885
13	Latitude	24.550558
14	Accuracy Level (see [Flickr API][4])	12
15	Photo/video page URL	http://www.flickr.com/photos/39089491@N00/6985418911/
16	Photo/video download URL	http://farm8.staticflickr.com/7205/6985418911_df7747990d.jpg
17	License name	Attribution-NonCommercial-NoDerivs License
18	License URL	http://creativecommons.org/licenses/by-nc-nd/2.0/
19	Photo/video server identifier	7205
20	Photo/video farm identifier	8
21	Photo/video secret	df7747990d
22	Photo/video secret original	692d7e0a7f
23	Extension of the original photo	jpg
24	Marker (0 ¼ photo, 1 ¼ video)	0

Table 1: Summary of Metadata for each CSV column available in the core dataset (yfcc100m_dataset.csv).

Next to this core dataset, several expansion packs have been released that provide additional data:

Autotags: Auto tags added by deep learning (e.g. people, animals, objects, food, events, architecture, and scenery)
Places: User provided geotags and automatically associated places.
Exif: Additional Exif data for each photo

As a means to enrich the spatial information, the places expansion set is available (but is not needed to follow the guides herein).

Column	Metadata Description	Example
0	Photo/video identifier	6985418911
1	Place reference (null to multiple)	24703176:Admiralty:Suburb,24703128:Central+and+Western:Territory

Table 2: Summary of Metadata for each CSV column available in the places expansion dataset (yfcc100m_places.csv).

Importing YFCC100m¶

To follow the examples used in this guide, follow the steps below.

Download of the YFCC100m data¶

Please follow the instructions provided on the official site.

Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved.

Importing the YFCC100m to Postgres¶

Any conversion from one structure to another requires the definition of mapping rules.

To demonstrate mapping of arbitrary LBSN data to the common LBSN structure scheme, we have built lbsntransform, a python package that includes several pre-defined mapping sets.

Note: Have a look at the exact mapping criteria for the Flickr YFCC100M dataset. The package also contains examples for other mappings (e.g. Twitter, Facebook Places), which can be extended further.

You'll also need a Postgres Database with the SQL Implementation of the LBSN Structure.

The easiest way is to use full-stack-lbsn, a shell script that starts the following docker services:

rawdb: A ready to use Docker Container with the SQL implementation of LBSN Structure
hlldb: A ready to use Docker Container with a privacy-aware version of LBSN Structure, e.g. for visual analytics
pgadmin: A web-based PostgreSQL database interface.
jupyterlab: A modern web-based user interface for python visual analytics.

Tip: If you're familiar with git and docker, you can also clone the above repositories separately and start individual services as needed.

Windows user? If you're working with Windows, full-stack-lbsn will only work in Windows Subsystem for Linux (WSL). Even if it is possible to run Docker containers natively in Windows, we strongly recommend using WSL or WSL2.

After you have started the rawdb docker container, import Flickr YFCC CSVs to the database using lbsntransform.

lbsntransform --origin 21 \
    --file_input \
    --input_path_url "/data/flickr_yfcc100m/" \
    --dbpassword_output "sample-password" \
    --dbuser_output "postgres" \
    --dbserveraddress_output "127.0.0.1:15432" \
    --dbname_output "rawdb" \
    --csv_delimiter $'\t' \
    --file_type "csv" \
    --zip_records

input_path_url: The path to the folder where yfcc100m_places.csv and yfcc100m_dataset.csv are saved.
dbpassword_output: Provide the password to connect to rawdb.
dbserveraddress_output: This is the default setup of rawdb running locally.
rawdb: The default database name of rawdb.
csv_delimiter: Flickr YFCC100M data is separated by tabs, which is specified in lbsntransform as $'\t' via the command line
file_type: Flickr YFCC100M data format is CSV (line separated).
zip_records: Length of yfcc100m_dataset.csv and yfcc100m_places.csv matches. This tells lbsntransform to concatenate both files on stream read.

Note: Reading the full dataset into the database will require at least 50 GB of hard drive and, depending on your hardware, up to several days of processing. You can read the dataset partially by adding --transferlimit 10000, to only read the first 10000 entries (e.g.).

Prepare RAW data for grid aggregation¶

Defining the Query¶

We're using a two database setup:

rawdb refers to the original social media data that is publicly available
hlldb refers to the privacy-aware data collection that is used in the visualization environment

To query original data from rawdb, and convert to hlldb, we're using a Materialized View.

Materialized View? Materialized views are static subsets extracted from larger PostgreSQL tables. MViews provide a performant filter for large queries. In a real-world example, data would typically be directly streamed and filtered, without the need for a Materialized View.

Connect to rawdb¶

Load dependencies¶

import os, sys
import psycopg2 # Postgres API
import geoviews as gv
import holoviews as hv
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

Load helper module from ../py/module/tools.py. This also allows to import code from other jupyter notebooks, synced to *.py with jupytext.

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules import tools, preparations

Set pandas colwidth.

pd.set_option('display.max_colwidth', 25)

Establish connection¶

Password is loaded from .env file specified in container setup hlldb.

The docker stack contains a full backup of the YFCC database converted to the privacy-aware datastructure. In this Notebook, we're only working with a small part of the data from the table spatial.latlng.

Define credentials as environment variables

db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
# set connection variables
db_host = "rawdb"
db_port = "5432"
db_name = "rawdb"

Connect to raw database:

db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)

Test connection:

db_query = """
    SELECT 1;
"""
# create pandas DataFrame from database data
df = pd.read_sql_query(db_query, db_connection)
display(df.head())

Simplify query access:

db_conn = tools.DbConn(db_connection)
db_conn.query("SELECT 1;")

If any SQL results in an error, the cursor cannot be used again. In this case, run db_connection.rollback() once, to reset the cursor.

Create Query Schema¶

Create a new schema called mviews and update Postgres search_path, to include new schema:

sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE rawdb
SET search_path = "$user",
                  social,
                  spatial,
                  temporal,
                  topical,
                  interlinkage,
                  extensions,
                  mviews;"""

Since the above query will not return any result, we'll directly use the psycopg2 cursor object:

cur = db_connection.cursor()
cur.execute(sql_query)
print(cur.statusmessage)

ALTER DATABASE

Prepare query and cryptographic hashing¶

HyperLogLog uses MurMurHash, which is a non-cryptographic hashing algorithm. This opens up some vulnerabilities such as described by Desfontaines et al. (2018).

As an additional measurement to prevent re-identification of users in the final dataset, we are are using Postgres pgcrypto extension to hash any IDs with a secure, unique key.

References:

Desfontaines, D., Lochbihler, A., & Basin, D. (2018). Cardinality Estimators do not Preserve Privacy. 1–21.

Create the pgcrypto extension:

sql_query = "CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA extensions;"
cur.execute(sql_query)
print(cur.statusmessage)

CREATE EXTENSION

Prepare cryptographic hash function. The following function will take an id and a key (the seed value) to produce a new, unique hash that is returned in hex encoding.

sql_query = """
/* Produce pseudonymized hash of input id with skey
 * - using skey as seed value
 * - sha256 cryptographic hash function
 * - encode in hex to reduce length of hash
 * - remove trailing '=' from base64 string
 * - return as text
 * - optional: truncate x characters from string
 */
CREATE OR REPLACE FUNCTION 
extensions.crypt_hash (IN id text, IN skey text)
RETURNS text
AS $$
    SELECT
        RTRIM(
            ENCODE(
                HMAC(
                    id::bytea,
                    skey::bytea,
                    'sha256'), 
                'base64'),
            '=')
$$
LANGUAGE SQL
STRICT;
"""
cur.execute(sql_query)
print(cur.statusmessage)

CREATE FUNCTION

Note: Cryptographic hashing alone will only produce pseudonymized data, since any id still relates to a single user (or post, etc.). Therefore, pseudonymization is considered a weak measure, which can be easily reversed, e.g. through [rainbow tables] or context lookup.

It is used here as an additional means to protect HLL sets from intersection attacks, e.g. as is discussed by Desfontaines et al. (2018).

What is a seed value? The seed value (skey) is a secret that is used, together with the encryption function (e.g. sha256) to produce a unique, collision-free output value (the hashed id). The same ID will be converted to a different hash if the seed value is changed. Therefore, in our case, the seed value must remain the same during the entire processing of data. In this case, the seed is called a key. This key can be destroyed afterwards, if no subsequent updates are necessary.

Apply RAW query¶

In this step, several filters are applied to significantly reduce data, as a provisional measure to reduce privacy risks.

The topic selection query below the following data reduction steps:

use only coordinates, post_guids, user_guids and userday
filter only photos with geoaccuracy 'place', 'latlng' or 'city', which broadly equals Flickr geoaccuracy levels 8-16
reduce spatial granularity to about 5km accuracy, by using PostGis GeoHash function, ST_GeoHash
apply cryptographic hashing to post_guid, user_guid and userday using the crypt_hash function defined earlier
reduce temporal granularity of post_create_date to yyyy-MM-dd, concat with user_guid

First, set the secret key (CRYPT_KEY) for the crypt function, ideally by an external config file.

Optional cleanup step:

sql_query = f"""
DROP MATERIALIZED VIEW mviews.spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02};
"""
cur.execute(sql_query)
print(cur.statusmessage)

DROP MATERIALIZED VIEW

%%time
sql_query = f"""
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02} AS
SELECT  extensions.crypt_hash(t1.post_guid, '{CRYPT_KEY}') as "post_guid",
        ST_Y(ST_PointFromGeoHash(ST_GeoHash(t1.post_latlng, {GEOHASH_PRECISION}), {GEOHASH_PRECISION})) As "latitude", 
        ST_X(ST_PointFromGeoHash(ST_GeoHash(t1.post_latlng, {GEOHASH_PRECISION}), {GEOHASH_PRECISION})) As "longitude", 
        extensions.crypt_hash(t1.user_guid, '{CRYPT_KEY}') as "user_guid",
        extensions.crypt_hash(t1.user_guid || to_char(t1.post_create_date, 'yyyy-MM-dd'), '{CRYPT_KEY}') as "userday",
        t1.post_geoaccuracy
FROM   topical.post t1
WHERE t1.post_geoaccuracy IN ('place', 'latlng', 'city');
"""
cur.execute(sql_query)
print(cur.statusmessage)

SELECT 47843599
CPU times: user 138 ms, sys: 78.9 ms, total: 217 ms
Wall time: 25min 22s

Info: GeoHash?

A GeoHash of 5 means, coordinates are reduced to 5 decimal digits maximum length of lat/lng. Compare the following table:

Precision (number of digits)	Distance of Adjacent Cell in Meters
1	5003530
2	625441
3	123264
4	19545
5	3803
6	610
7	118
8	19
9	3.71
10	0.6

Table: GeoHash length and corresponding geoaccuracy in meters (Source: Wikipedia).

Convert data from rawdb to hlldb¶

First, create a connection from hlldb to rawdb.

Prepare rawdb¶

On rawdb, create an lbsn_reader with read-only privileges for schema mviews.

sql_query = """
SELECT 1 FROM pg_roles WHERE rolname='lbsn_reader'
"""
result = db_conn.query(sql_query)
if result.empty:
    # if user does not exist
    sql_query = f"""    
    CREATE USER lbsn_reader WITH
        LOGIN
        INHERIT
        PASSWORD '{USER_KEY}';
        
    GRANT CONNECT ON DATABASE rawdb TO lbsn_reader;
    GRANT USAGE ON SCHEMA mviews TO lbsn_reader;
    ALTER DEFAULT PRIVILEGES IN SCHEMA mviews GRANT SELECT ON TABLES TO lbsn_reader;
    """
    cur.execute(sql_query)
    print(cur.statusmessage)

Connect hlldb to rawdb¶

By using Foreign Table, this step will establish the connection between hlldb to rawdb.

On hlldb, install postgres_fdw extension:

sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur.execute(sql_query)
print(cur.statusmessage)

CREATE EXTENSION

Create Foreign Server on hlldb:

db_connection_hll = psycopg2.connect(
        host="hlldb",
        port=db_port,
        dbname="hlldb",
        user=db_user,
        password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)

SELECT 1

sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw 
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
    host 'rawdb',
    dbname 'rawdb',
    port '5432',
    keepalives '1',
    keepalives_idle '30',
    keepalives_interval '10',
    keepalives_count '5',
    fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
    SERVER lbsnraw 
    OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)

CREATE USER MAPPING

Import foreign table definition on the hlldb.

sql_query = f"""
SELECT EXISTS (
   SELECT FROM information_schema.tables 
   WHERE  table_schema = 'extensions'
   AND    table_name   = 'spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02}'
   );
"""
result = db_conn_hll.query(sql_query)
if not result["exists"][0]:
    # only import table 
    # if it hasn't been imported already
    sql_query = f"""
    IMPORT FOREIGN SCHEMA mviews
        LIMIT TO (spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02})
        FROM SERVER lbsnraw 
        INTO extensions;
    """
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

Tip: Optionally: optimize chunk size

Depending on your hardware, optimizing Postgres fetch_size may increase processing speed:

ALTER SERVER lbsnraw
OPTIONS (SET fetch_size '50000');

Prepare conversion of raw data to hll¶

We're going to use spatial.latlng from the HLL Structure definition. The structure for this table is already available, by default, in hlldb.

Create table structure:

sql_query = """
CREATE TABLE IF NOT EXISTS spatial.latlng (
    latitude float,
    longitude float,
    PRIMARY KEY (latitude, longitude),
    latlng_geom geometry(Point, 4326) NOT NULL)
INHERITS (
    social.user_hll, -- e.g. number of users/latlng (=upl)
    topical.post_hll, -- e.g. number of posts/latlng
    temporal.date_hll -- e.g. number of dates/latlng
);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)

CREATE TABLE

HyperLogLog parameters¶

The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.

From a privacy perspective, for example, it is recommended to disable explicit mode.

Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.

To disable explicit mode:

sql_query = """
SELECT hll_set_defaults(11, 5, 0, 1);
"""
db_conn_hll.query(sql_query)

From now on, HLL sets will directly be promoted to sparse.

In Sparse Hll Mode, more data is stored than in Full Hll Mode, as a means to improve accuracy for small sets. As pointed out by Desfontaines et al. 2018, this may make re-identification easier. Optionally disable sparse mode: ```sql SELECT hll_set_defaults(11,5, 0, 0); ``` The caveat here is that both required storage size and processing time increase.

Aggregation step: Convert data to Hll¶

This is the actual data collection and aggregation step. In the query below, different metrics are collected that are typical for LBSM visual analytics (postcount, usercount, userdays).

Make sure that no data exists in table spatial.latlng. Optional cleanup step:

sql_query = """
TRUNCATE spatial.latlng;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)

TRUNCATE TABLE

sql_query = """
SELECT * from spatial.latlng limit 10;
"""
results = db_conn_hll.query(sql_query)
results.empty

True

To test: uncomment --LIMIT 100 below

%%time
if results.empty:
    sql_query = f"""
    INSERT INTO spatial.latlng(
                latitude, 
                longitude, 
                user_hll, 
                post_hll, 
                date_hll, 
                latlng_geom)
        SELECT  latitude,
                longitude,
                hll_add_agg(hll_hash_text(user_guid)) as user_hll,
                hll_add_agg(hll_hash_text(post_guid)) as post_hll,
                hll_add_agg(hll_hash_text(userday)) as date_hll,
                ST_SetSRID(ST_MakePoint(longitude, latitude), 4326) as latlng_geom
        FROM extensions.spatiallatlng_raw_geohash_{GEOHASH_PRECISION:02}
        --LIMIT 100
        GROUP BY latitude, longitude;
    """
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
else:
    print("Table contains already data. Please cleanup first (TRUNCATE spatial.latlng;)")

INSERT 0 451949
CPU times: user 4.28 s, sys: 2.62 s, total: 6.9 s
Wall time: 18h 35min 29s

Commit changes to db and close connection:

db_connection_hll.commit()

db_connection_hll.close ()

This query will take some time, depending on your machine.

If, for some reason, the query does not work - try directly connecting to the psql console:

docker exec -it lbsn-hlldb /bin/bash
psql -h localhost -p 5432 -U postgres hlldb

Convert notebook to HTML

!jupyter nbconvert --to html_toc \
    --output-dir=../out/html ./01_preparations.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False # create single output file

[NbConvertApp] Converting notebook ./01_preparations.ipynb to html_toc
[NbConvertApp] Writing 1881875 bytes to ../out/html/01_preparations.html

Visualize data¶

In the raw and hll notebook, we'll aggregate data per 100km grid. Follow in 02_yfcc_gridagg_raw.ipynb and 03_yfcc_gridagg_hll.ipynb