Actvity Comparison ATKIS-Intersection LBSM-DE

In this notebook, we'll explore ways to visualize activity rankings across different types of land use. Land use data is derived from ATKIS Basis-DLM (Selected categories) and intersected with geolocated Social Media Posts (Flickr, Instagram Twitter). From Originally 35 Million Social Media Posts, about 8 Million are found in the subset of chosen categories. This data is the base for the analysis in this notebook. The process for intersecting ATKIS and LBSM-Data is shown here.

Imports and logging

First, we start with our imports and get logging established:

In [19]:
# imports needed and set up logging
# import gensim 
import os
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import holoviews as hv
import re
from collections import defaultdict
from collections import namedtuple
import csv
from pathlib import Path
import numpy as np
import pandas as pd
hv.extension('bokeh')

Dataset preview

The Dataset(s) we will be loading have already been intersected with land-use data - therefore, we can dive straight into analysis, without prior classification. Have a look first:

In [20]:
Post = namedtuple('Post','origin_id post_guid user_guid post_body post_title hashtags emoji, post_time')
data_file = '03_Output_LBSM/Germany_LBSM_weinbau.csv'

def get_post(post_line):
    """Concatenate topic info from post columns"""
    origin_guid = post_line.get('origin_id')
    post_guid = post_line.get('post_guid')
    user_guid = post_line.get('user_guid')
    post_title = post_line.get('post_title')
    post_body = post_line.get('post_body')
    hashtags = post_line.get('tags').split(';')
    emoji = post_line.get('emoji').split(';')
    post_time_hr = post_line.get('post_time')[:10]
    return Post(origin_guid, post_guid, user_guid, post_body, post_title, hashtags, emoji, post_time_hr)

with open(data_file, 'r', encoding="utf-8") as file_handle:
    post_reader = csv.DictReader(
                file_handle,
                delimiter=',',
                quotechar='"',
                quoting=csv.QUOTE_MINIMAL)
    for ix, post in enumerate(post_reader):
        print(f'{post}\n')
        lbsn_post = get_post(post)
        print(f'{lbsn_post}')
        break
OrderedDict([('origin_id', '2'), ('post_guid', 'd1defaaee172f757e6791926db8c910a'), ('user_guid', '1ee8a2d889cda05b776f4bbd976b266a'), ('origin_dist', '321'), ('atkis_cat', 'weinbau'), ('gemeinde_typ', 'Größere Kleinstadt'), ('post_time', '2010-04-11 17:43:42'), ('post_thumbnail_url', ''), ('post_views_count', '20'), ('post_like_count', ''), ('post_url', ''), ('tags', 'carygreisch;deu;fellerich;geo:lat=4968855300;geo:lon=650608200;geotagged;germany;rheinlandpfalz'), ('emoji', ''), ('post_title', 'Fellerich'), ('post_body', 'Fellerich, Rheinland-Pfalz, Deutschland'), ('post_geoaccuracy', 'latlng'), ('post_comment_count', ''), ('post_type', 'image'), ('post_filter', ''), ('place_guid', 'b88b0ff28ebfb4b1c15bb3e16d54c9f0'), ('place_name', '')])

Post(origin_id='2', post_guid='d1defaaee172f757e6791926db8c910a', user_guid='1ee8a2d889cda05b776f4bbd976b266a', post_body='Fellerich, Rheinland-Pfalz, Deutschland', post_title='Fellerich', hashtags=['carygreisch', 'deu', 'fellerich', 'geo:lat=4968855300', 'geo:lon=650608200', 'geotagged', 'germany', 'rheinlandpfalz'], emoji=[''], post_time='2010-04-11')

Read files into a list

Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Ranking model. We'll stream files and only process post by post to reduce memory burden.

In [21]:
def scan_local_files():
    """Read Local Files according to config parameters"""
    pathname = Path.cwd()
    input_path = pathname / '03_Output_LBSM'
    filelist = list(input_path.glob(
        f'*.csv'))
    return filelist
def read_input_file(input_file):
    """Read Input file lines and convert to post"""
    logging.info(f"Reading file {os.path.basename(input_file)}..")
    with open(input_file, 'r', encoding="utf-8") as file_handle:
        post_reader = csv.DictReader(
                file_handle,
                delimiter=',',
                quotechar='"',
                quoting=csv.QUOTE_MINIMAL)
        for ix, post_line in enumerate(post_reader):
            lbsn_post = get_post(post_line)
            if (ix % 100000 == 0):
                logging.info (f"read {ix} posts")
            # do some pre-processing and return a list of words for each review text
            yield lbsn_post  

Topic Selection and Ranks

First, we'll define our topics. A topic is defined as a list of terms. Note that an "activity" can be defined specific or broad:

  • hiking would be a specific activity, which can be described by a longer walk, slow pace, perhaps done in groups and on planned occation (e.g. a day trip). Someone would not describe a 5 minute walk as hiking.
  • sports on the other hand is group of activities; sports usually has a defined purpose regarding fitness or health; there are many sports such as jogging, hiking, playing football etc. which all have their individual benefits for certain health aspects
  • friends can be described as another activity group with the main purose of socializing: one meets with freinds to talk, interact and communicate. Some specific activities are more suitable for socializing than others, e.g. game or group-activities that are usually done together

As a conclusion, we can say that we want to define our topics as diverse as possible. Some activities or groups of activities might overlap, while other might describe opposide ends of a continuum of possible activity-groups. The goal here is not to be holistic, but to get a cross-section of a selected list of relevant green space activities.

Furthermore:

  • related terms can be queried using relatedwords.org
  • for an overview of important activity categories, see wikipedia
  • future goal: replace by topic vector
In [22]:
topics = dict()  
topics['hiking'] = ('hike', 'hiking', 'wandern', 'wanderung', 'wanderer', 'wanderweg', 'wanderroute', '🥾') # optional: 🚶 (person walking)
# biking, this is a very specific atcivity
topics['biking'] = ('bike', 'biking', 'bicycle', 'cycling', 'fahrrad', 'velo', '🚲', '🚴')
# just plain walking
topics['walking'] = ('walk', 'walking', 'spazieren', 'stroll', 'fußweg', 'spazierweg', 'spaziergang') # optional: 🚶 (person walking)
# broad category with a bias towards jogging
topics['sport']  = ('sport', 'jogging', 'running', 'exercise', 'run', 'workout', 'rennen', 'dauerlauf', '🏃')
topics['relaxing'] = ('relaxing', 'sitting', 'relaxation', 'entspannen', 'innehalten', 'erholen', 'ausruhen', 'recreation')
# meeting with friends, this can encompass a group of activities
# note that we use 'meeting'; in green-space land use, this likely hints to meeting with friends, not within work environment
topics['friends'] = ('friends', 'friends', 'meeting', 'socialize', 'freunde', 'treffen', 'hang around', 'abhängen')
# anything related to family and kinder/kids
topics['family'] = ('family', 'familie', 'kinder', 'baby', 'familienausflug', 'familytrip', '👪')
# tourist/sighseeing group
topics['tourist'] = ('tourist', 'sighseeing', 'sehenswürdigkeit', 'excursion', 'exkursion', 'sight-seeing', 'tour', 'travel', 'reise', '🌇')
# very general: spielen/playing
topics['playing'] = ('spielen', 'playing', 'play', 'spiel', 'game', '🎲', '🎮')
# lets add some specific activities: picknick-grillen, soccer ..
topics['picnic']  = ('picnic', 'barbecure', 'picknick', 'picknickkorb', 'grillen', 'grill')
topics['soccer'] = ('soccer', 'fussball', 'fußball', 'football', '⚽')

Counting Posts/ Userdays based on matching topics

For selecting posts and counting userdays based on topic-terms, we define the following rules:

  • search terms in title, post_body, tags and emoji because some people might not use tags at all, others might only provide titles and yet other mainly communicate using emoji
  • when searching terms in post_body or title, only match full words - e.g. "walkman" is ignored when searching for "walk"; this allows for more specific semantic disambiguation; we use the regex module for full word matching
  • emoji sometimes accurately encapsulate specific activity-meaning, therefore we can make an exception to full-word-search when using emoji - e.g. it is quite clear that 🚴 is usually used in the context of biking
  • we ignore case of characters when searching (upper or lower)
  • since tags might repeat often in posts of a single user, we user UserDays as the most appropiate metric to measure activities: this will count repititive behaviour only once, e.g. someone might upload 500 pictures with the same tags of their single sunday picknick, this will only count as one; however, if parks are visited by the same person at multiple days, each day will be counted once: the allows us to include typical patterns of behaviuour for green-land-use such as parks beeing visited often during a month for recreation (but shopping is perhaps done less frequently).
In [24]:
%%time
from IPython.display import clear_output

def word_in_text(word, text_value):
    """Checks whether full word is in string"""
    if re.search(r'\b' + word + r'\b', text_value, re.IGNORECASE):
        return True

def check_topic(topic, lbsn_post):
    """Checks whether topic is in post"""
    for term in topic:
        if \
        term in lbsn_post.hashtags or \
        term in lbsn_post.emoji or \
        word_in_text(term, lbsn_post.post_title) or \
        word_in_text(term, lbsn_post.post_body):
            return True
def get_userday(post):
    return f'{post.user_guid}{post.post_time}'

# init count structures
total_counts = dict()
total_userdaycounts = defaultdict(set)
cnt_dict = dict()
userday_cnt_dict = dict()

# init dicts for each topic
for activity_name in topics.keys():
    # use default dict to init int:
    cnt_dict[activity_name] = defaultdict(int)
    # use set for counting userdays:
    userday_cnt_dict[activity_name] = defaultdict(set)
    
# perform topic matching
for file_name  in scan_local_files():
    # get land use type from filename
    f_name = os.path.basename(file_name)
    if f_name == 'all_intersected_guids.csv':
        # skip
        continue
    # strip leading 'Germany_LBSM_'
    type_text = f_name[13:].rstrip('.csv')
    total_counts[type_text] = 0
    # loop posts
    for lbsn_post in read_input_file(file_name):
        # count post
        total_counts[type_text] += 1
        # count userday
        userday = get_userday(lbsn_post)
        total_userdaycounts[type_text].add(userday)
        for activity_name, topic_terms in topics.items():
            if check_topic(topic_terms, lbsn_post):
                # count post
                cnt_dict[activity_name][type_text] += 1
                # count userday
                userday_cnt_dict[activity_name][type_text].add(userday)
    # count distinct userdays
    for activity_name in topics.keys():
        userday_cnt_dict[activity_name][type_text] = len(userday_cnt_dict[activity_name].get(type_text))


clear_output(wait=True)
selected_cnt = sum([sum(x.values()) for x in cnt_dict.values()])
total_cnt = sum(total_counts.values())
perc_cnt = selected_cnt/(total_cnt/100)
print(
    f'Done. Found topic matches in {selected_cnt} posts '
    f'of {total_cnt} total posts ({perc_cnt:.2f}%)')

for land_use in total_userdaycounts.keys():
        total_userdaycounts[land_use] = len(total_userdaycounts[land_use])
selected_userdays = sum([sum(x.values()) for x in userday_cnt_dict.values()])
total_userdays  = sum(total_userdaycounts.values())
perc_userdays  = selected_userdays /(total_userdays/100)
print(
    f'Done. Found topic matches in {selected_userdays} userdays '
    f'of {total_userdays} total userdays ({perc_userdays:.2f}%)')
Done. Found topic matches in 1124397 posts of 7948315 total posts (14.15%)
Done. Found topic matches in 777017 userdays of 4559082 total userdays (17.04%)
Wall time: 1h 17min 40s

Convert dict to pandas dataframe for easier handling. We can choose to analyse absolute post counts here (prone to error but fast) or userdays (less prone but slow to calculate)

In [25]:
#df = pd.DataFrame.from_dict(cnt_dict)
df = pd.DataFrame.from_dict(userday_cnt_dict)
# get preview
df.style.background_gradient(cmap='viridis')
Out[25]:
hiking biking walking sport relaxing friends family tourist playing picnic soccer
ackerland 4576 4921 6277 8210 1540 9960 10573 11451 2441 1076 1698
friedhof 272 261 834 426 141 825 1004 1072 189 47 77
gartenland 13 34 45 87 20 56 59 137 18 8 14
gehoelz 4427 1144 1599 1732 295 2097 1729 4104 439 181 252
golfplatz 75 43 111 544 68 306 207 225 335 20 31
gruenland 11482 5184 7308 7777 1647 9049 9707 14524 2019 1067 1190
heide 393 108 302 149 48 101 146 265 37 13 10
kleingarten 153 426 640 896 170 1052 1034 782 278 224 209
laubholz 6747 2446 4192 3952 644 3158 3691 6042 769 406 430
mischholz 9737 2389 3664 3361 632 2857 3011 8208 649 277 345
moor 205 62 193 90 36 95 124 174 21 5 7
nadelholz 15100 2371 3743 3615 757 3221 3439 11012 616 317 265
obstbau 54 36 45 42 15 73 85 70 28 15 17
parkgruenanlage 3906 12931 30943 32676 6031 32426 22242 51202 6672 7091 2508
sonstlandwirt 43 50 42 38 4 45 30 41 15 5 6
sonstsiedlungsfreifl 110 65 131 88 13 119 117 216 25 13 17
sportfreizeiterholung 5853 8619 9129 44879 5517 41079 35400 33764 26670 2320 63960
streuobst 222 127 246 190 45 229 213 236 76 53 31
sumpf 125 104 131 90 33 153 266 260 32 11 16
weinbau 684 171 471 183 64 273 277 497 66 42 38
wochenendferienhau 328 185 345 404 245 498 771 1044 127 59 34
In [26]:
# post counts:
#df_total = pd.DataFrame.from_dict(
#    total_counts.items())
# user days:
df_total = pd.DataFrame.from_dict(
    total_userdaycounts.items())

Optional: store intermediate results (pandas dataframe pickle)

In [27]:
# write:
#df.to_pickle("activity_intermediate_userdays.pkl")
#df_total.to_pickle("activity_total_userdays.pkl")
# load:
df = pd.read_pickle("activity_intermediate_userdays.pkl")
df_total = pd.read_pickle("activity_total_userdays.pkl")

Compare to total post counts:

In [28]:
print('Post count per topic')
df_postcount = pd.DataFrame.from_dict(cnt_dict)
df_postcount.style.background_gradient(cmap='viridis')
Post count per topic
Out[28]:
hiking biking walking sport relaxing friends family tourist playing picnic soccer
ackerland 7172 5967 8082 10801 1627 11612 12627 14027 2761 1231 1986
friedhof 362 457 2434 632 184 1254 2070 1560 203 56 162
gartenland 20 55 59 370 25 58 65 278 24 8 14
gehoelz 6631 1789 2004 5358 312 2290 1982 5951 910 208 321
golfplatz 119 48 127 1112 68 319 265 863 530 21 32
gruenland 19661 8169 9325 15254 1945 11597 11958 19743 2451 1327 2914
heide 648 128 419 175 53 127 181 324 40 14 25
kleingarten 200 542 834 980 171 1475 1205 892 349 258 982
laubholz 12612 3614 5694 5478 936 4027 4611 9055 1078 445 765
mischholz 16294 3171 5305 5068 715 3396 3781 11173 1073 343 440
moor 382 72 260 93 82 96 134 289 22 6 7
nadelholz 23047 3298 5401 5476 888 3539 4403 14222 1185 371 322
obstbau 83 49 60 50 15 79 97 97 34 16 19
parkgruenanlage 5250 16721 39408 39553 6819 39425 27206 69110 7788 9329 3761
sonstlandwirt 61 54 51 39 4 47 41 46 16 6 7
sonstsiedlungsfreifl 150 80 170 110 14 127 135 310 31 14 24
sportfreizeiterholung 6978 13663 15229 92245 6507 53643 43391 47869 42556 2713 120049
streuobst 428 175 395 438 47 274 323 360 127 87 37
sumpf 154 118 167 2426 35 159 286 343 32 11 16
weinbau 1424 778 792 436 65 598 315 792 68 100 44
wochenendferienhau 366 228 369 433 278 546 855 1327 137 87 34

These are absolute values with little meaning because some land use types appear more often, similarly, some activities have typica a higher frequency of matches. To normalize these values, we'll therefore calculate absolute percentages first for each landuse category. Afterwards, we can nornmalize (i.e. stretch) results to 0-100 range.

replace index names for display

In [41]:
name_ref = {
'gruenland':'Gruenland',
'ackerland':'Ackerland',
'laubholz':'Laubholz',
'nadelholz':'Nadelholz',
'gehoelz':'Gehoelz',
'mischholz':'Mischholz',
'sportfreizeiterholung':'sonst. Sport-, Freizeit-, Erholungsfl.',
'streuobst':'Streuobst',
'parkgruenanlage':'Park, Gruenanlage',
'friedhof':'Friedhof',
'kleingarten':'Kleingarten',
'moor':'Moor',
'weinbau':'Weinbau',
'obstbau':'Obstbau',
'sonstlandwirt':'sonst. Landwirtschaftsfl.',
'sumpf':'Sumpf',
'wochenendferienhau':'Wochenend-, Ferienhaussiedl.',
'gartenland':'Gartenland',
'heide':'Heide',
'sonstsiedlungsfreifl':'sonstige Siedlungsfreifl.',
'golfplatz':'Golfplatz',
}
df.rename(index=name_ref, inplace=True)
df.index
#for dict_key, value_count in total_counts.items():
#    total_counts[name_ref.get(dict_key)] = value_count
#    total_counts.pop(dict_key)
Out[41]:
Index(['Ackerland', 'Friedhof', 'Gartenland', 'Gehoelz', 'Golfplatz',
       'Gruenland', 'Heide', 'Kleingarten', 'Laubholz', 'Mischholz', 'Moor',
       'Nadelholz', 'Obstbau', 'Park, Gruenanlage',
       'sonst. Landwirtschaftsfl.', 'sonstige Siedlungsfreifl.',
       'sonst. Sport-, Freizeit-, Erholungsfl.', 'Streuobst', 'Sumpf',
       'Weinbau', 'Wochenend-, Ferienhaussiedl.'],
      dtype='object')
In [30]:
df_total.columns = ['Land use', 'Post Count']
df_total = df_total.set_index(['Land use'])
df_total.rename(index=name_ref, inplace=True)
df_total['Percentage'] = df_total['Post Count']/(total_cnt/100)
df_total.style.background_gradient(cmap='summer')
Out[30]:
Post Count Percentage
Land use
Ackerland 438550 5.51752
Friedhof 47857 0.602102
Gartenland 4507 0.0567038
Gehoelz 112909 1.42054
Golfplatz 12116 0.152435
Gruenland 440041 5.53628
Heide 9163 0.115282
Kleingarten 49571 0.623667
Laubholz 182892 2.30102
Mischholz 172552 2.17093
Moor 6650 0.0836655
Nadelholz 185159 2.32954
Obstbau 4414 0.0555338
Park, Gruenanlage 1392178 17.5154
sonst. Landwirtschaftsfl. 2264 0.028484
sonstige Siedlungsfreifl. 7201 0.0905978
sonst. Sport-, Freizeit-, Erholungsfl. 1428244 17.9691
Streuobst 11849 0.149076
Sumpf 8221 0.103431
Weinbau 17709 0.222802
Wochenend-, Ferienhaussiedl. 25035 0.314972
In [44]:
# transpose
df_perc = df.T
# normalize using total counts for each land use cat
for type_text, total_count in total_userdaycounts.items():
    type_text = name_ref.get(type_text)
    df_perc[type_text] = df_perc[type_text]/(total_count/100)
# transpose again
df_perc = df_perc.T
In [45]:
# show percentages
df_perc.style.background_gradient(cmap='summer')
#df.index
#df.columns
#df.shape
Out[45]:
hiking biking walking sport relaxing friends family tourist playing picnic soccer
Ackerland 1.04344 1.12211 1.43131 1.87208 0.351157 2.27112 2.4109 2.6111 0.556607 0.245354 0.387185
Friedhof 0.56836 0.545375 1.74269 0.890152 0.294628 1.72389 2.09792 2.24001 0.394927 0.0982092 0.160896
Gartenland 0.28844 0.754382 0.998447 1.93033 0.443754 1.24251 1.30907 3.03972 0.399379 0.177502 0.310628
Gehoelz 3.92086 1.01321 1.41618 1.53398 0.261272 1.85725 1.53132 3.63479 0.388809 0.160306 0.223189
Golfplatz 0.619016 0.354903 0.916144 4.48993 0.561241 2.52559 1.70848 1.85705 2.76494 0.165071 0.25586
Gruenland 2.6093 1.17807 1.66075 1.76734 0.374283 2.0564 2.20593 3.3006 0.458821 0.242477 0.270429
Heide 4.28899 1.17865 3.29586 1.6261 0.523846 1.10226 1.59336 2.89207 0.403798 0.141875 0.109135
Kleingarten 0.308648 0.859373 1.29108 1.80751 0.342942 2.12221 2.0859 1.57754 0.560812 0.451877 0.421617
Laubholz 3.68906 1.3374 2.29206 2.16084 0.35212 1.7267 2.01813 3.30359 0.420467 0.221989 0.235111
Mischholz 5.64294 1.38451 2.12342 1.94782 0.366266 1.65573 1.74498 4.75683 0.376119 0.160531 0.19994
Moor 3.08271 0.932331 2.90226 1.35338 0.541353 1.42857 1.86466 2.61654 0.315789 0.075188 0.105263
Nadelholz 8.15515 1.28052 2.02151 1.95238 0.408838 1.73959 1.85732 5.94732 0.332687 0.171204 0.14312
Obstbau 1.22338 0.815587 1.01948 0.951518 0.339828 1.65383 1.92569 1.58586 0.634345 0.339828 0.385138
Park, Gruenanlage 0.280568 0.928832 2.22263 2.34711 0.433206 2.32916 1.59764 3.67783 0.479249 0.509346 0.180149
sonst. Landwirtschaftsfl. 1.89929 2.20848 1.85512 1.67845 0.176678 1.98763 1.32509 1.81095 0.662544 0.220848 0.265018
sonstige Siedlungsfreifl. 1.52757 0.902652 1.81919 1.22205 0.18053 1.65255 1.62477 2.99958 0.347174 0.18053 0.236078
sonst. Sport-, Freizeit-, Erholungsfl. 0.409804 0.603468 0.639176 3.14225 0.386279 2.87619 2.47857 2.36402 1.86733 0.162437 4.47823
Streuobst 1.87358 1.07182 2.07612 1.60351 0.379779 1.93265 1.79762 1.99173 0.641404 0.447295 0.261625
Sumpf 1.5205 1.26505 1.59348 1.09476 0.401411 1.86109 3.23562 3.16263 0.389247 0.133804 0.194624
Weinbau 3.86244 0.965611 2.65966 1.03337 0.361398 1.54159 1.56418 2.80648 0.372692 0.237168 0.21458
Wochenend-, Ferienhaussiedl. 1.31017 0.738965 1.37807 1.61374 0.97863 1.98922 3.07969 4.17016 0.50729 0.23567 0.13581

We'll use Holoviews Heatmap to display data

# adjust outlier for post count df_perc['sport'].Sumpf = 1
In [63]:
from holoviews import opts
hv.HeatMap({'x': df_perc.columns, 'y': df_perc.index, 'z': df_perc}, ['x', 'y'], 'z'
          ).opts(opts.HeatMap(tools=['hover'], colorbar=True, width=700, height=400,cmap='greens'))
Out[63]:

To improve legebility and colorization, we stretch values for each actitity to 1-100 range. Furthermore, we use log-values to reduce peaks and highlight information in the long tail.

from sklearn import preprocessing x = df.values #returns a numpy array min_max_scaler = preprocessing.MinMaxScaler() x_scaled = min_max_scaler.fit_transform(x) df = pd.DataFrame(x_scaled) ## normalize using total counts for each land use cat #for type_text, total_count in total_counts.items(): # df[type_text] = df[type_text]/(total_count/100) #np.interp(z, (z.min(), z.max()), (1, 1000))
In [47]:
def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result
# log scale (reduce peaks) and normalize (0-1 range)
df_norm = normalize(np.log(df_perc))

Calculate Alpha Values (transparency of cells) from total available posts (= accuracy)

In [48]:
# log-scale and normalize between 0.5 and 1 (= final transparency)
df_total['Log-Norm. Percentage'] = np.interp(
    np.log(df_total['Percentage']), np.log((df_total['Percentage'].min(), df_total['Percentage'].max())), (0.5, 1))
df_alpha = df_total['Log-Norm. Percentage']
np_alpha = df_alpha.values
np_alpha = np.tile(np_alpha, (len(topics), 1)).transpose()
df_alpha = pd.DataFrame(np_alpha)
df_alpha.index = df.index
In [62]:
from holoviews import dim, opts 
from bokeh.models import HoverTool

def hook(plot, element):
    # remove axis for plot
    plot.handles['xaxis'].visible = False
    plot.handles['yaxis'].visible = False
    plot.outline_line_color = None
    plot.border_fill_color = None
    plot.background_fill_color = None
    plot.outline_line_width = 0
    plot.outline_line_alpha = 0
    #plot.axis.visible = False

# explicitly declare hover tool so we can add "%" sign
TOOLTIPS = [
    ('Activity (LBSM)', '@x'),
    ('Land Use (ATKIS)', '@y'),
    ('Relative importance (Log & Norm 0-1)', '@z{1.1111}'),
    ('Percentage of userdays (abs)', '@z2{1.11}%'),
    ('Total userdays (abs)', '@z3'),
]
hover = HoverTool(tooltips=TOOLTIPS)

hv.HeatMap({'x': df.columns, 'y': df.index, 'z': df_norm, 'z2': df_perc, 'z3': df, 'z4': df_alpha}, 
           kdims=[('x', 'Activity (LBSM)'), ('y', 'Land Use (ATKIS)')], 
           vdims=['z', 'z2', 'z3', 'z4'], 
    ).opts(
           opts.HeatMap(
           title_format="Heatmap for selected ATKIS categories and LBSM activities",
           tools=[hover], 
           colorbar=True, 
           width=720, 
           height=520,
           cmap='greens'
           #alpha='z4' # dim cells based on total available posts (=accuracy)
           )
        )
# use http://tools.zenverse.net/word-wrap/ for word wrap
Out[62]:

Measuring similarity betwen different activities / betwen land uses

Dot Product (Skalarprodukt) can be used to compute vectors between all values of two columns (i.e. activities) or two rows (i.e. land uses). This can be used to compare patterns based on cosine similarity. A cosine similarity of 1 means identical, where 0 means completely different.

In [50]:
from scipy.spatial.distance import cosine
from pandas import DataFrame

print(f'hiking/biking: {1 - cosine(df["hiking"], df["biking"])}')
print(f'walking/soccer: {1 - cosine(df["walking"], df["soccer"])}')
print(f'sport/soccer: {1 - cosine(df["sport"], df["soccer"])}')
      
print(f'Park, Gruenanlage/Friedhof: {1 - cosine(df.loc["Park, Gruenanlage"], df.loc["Friedhof"])}')
print(f'Golfplatz/Nadelholz: {1 - cosine(df.loc["Golfplatz"], df.loc["Nadelholz"])}')
hiking/biking: 0.6152442856715981
walking/soccer: 0.3112305944068454
sport/soccer: 0.8156649130861259
Park, Gruenanlage/Friedhof: 0.9435236031345471
Golfplatz/Nadelholz: 0.5082803251293814

Clustered Heatmap

We can use these similarity scores to re-order the heatmap based on similarity measures. Seaborn, for example, offers clustermap, which allows specifiying different cluster and metrics methods. There are many other ways to created clustered heatmaps (see links below).

In [59]:
import seaborn as sns
heatmap_sns = sns.clustermap(df_norm, metric="correlation", standard_scale=1, method="ward", cmap="Greens")
In [61]:
heatmap_sns.savefig("clusterheatmap_userdays_greens.png")
heatmap_sns.savefig("clusterheatmap_userdays_greens.svg",format="svg") 

Reorder columns/rows

We can access the reordered rows/columns from the seaborn plot using this suggestion. Afterwards, the original dataframes can be updated using the new ordering.

In [428]:
print(f'rows: {heatmap_sns.dendrogram_row.reordered_ind}')
print(f'columns: {heatmap_sns.dendrogram_col.reordered_ind}')
rows: [3, 9, 11, 5, 8, 6, 10, 15, 19, 2, 4, 16, 1, 18, 20, 13, 0, 7, 14, 12, 17]
columns: [1, 0, 2, 7, 4, 6, 9, 10, 5, 3, 8]
In [53]:
#columnsTitles = ['onething', 'secondthing', 'otherthing']

# get col and row names by ID
colname_list = [df.columns[col_id] for col_id in heatmap_sns.dendrogram_col.reordered_ind]
rowname_list = [df.index[row_id] for row_id in heatmap_sns.dendrogram_row.reordered_ind]
# change row/col order 
df_ro = df.reindex(rowname_list)
df_ro = df_ro[colname_list]
df_norm_ro = df_norm.reindex(rowname_list)
df_norm_ro = df_norm_ro[colname_list]
df_perc_ro = df_perc.reindex(rowname_list)
df_perc_ro = df_perc_ro[colname_list]
In [430]:
print(rowname_list)
print(colname_list)
['Gehoelz', 'Mischholz', 'Nadelholz', 'Gruenland', 'Laubholz', 'Heide', 'Moor', 'sonstige Siedlungsfreifl.', 'Weinbau', 'Gartenland', 'Golfplatz', 'sonst. Sport-, Freizeit-, Erholungsfl.', 'Friedhof', 'Sumpf', 'Wochenend-, Ferienhaussiedl.', 'Park, Gruenanlage', 'Ackerland', 'Kleingarten', 'sonst. Landwirtschaftsfl.', 'Obstbau', 'Streuobst']
['biking', 'hiking', 'walking', 'tourist', 'relaxing', 'family', 'picnic', 'soccer', 'friends', 'sport', 'playing']

Compose final output and store to html

In [64]:
%%output filename="meingruen_activities_userdays" # uncomment for output to file 
heatm = hv.HeatMap({'x': df_ro.columns, 'y': df_ro.index, 'z': df_norm_ro, 'z2': df_perc_ro, 'z3': df_ro, 'z4': df_alpha}, 
           kdims=[('x', 'Activity (LBSM)'), ('y', 'Land Use (ATKIS)')], 
           vdims=['z', 'z2', 'z3', 'z4'], 
    ).opts(
           opts.HeatMap(
           title_format="Heatmap for selected ATKIS categories and LBSM activities",
           tools=[hover], 
           colorbar=True, 
           width=720, 
           height=520,
           cmap='greens'
           #alpha='z4' # dim cells based on total available posts (=accuracy)
           )
        ) 
heatm + \
hv.Text(x=0.01, y=0.5, 
        text='Geotagged Social Media posts (Twitter,\n'
             'Instagram, Flickr) have first been\n'
             'intersected with ATKIS geometries for\n'
             'Germany. This heatmap shows the correlation\n'
             'between selected activities expressed in\n'
             'intersected Social Media posts and the bias\n'
             'for certain land use types (ATKIS).\n'
             'Dark-green colors mean high correlation (1),\n'
             'whereas lighter colors mean low correlation\n'
             '(0) between land use and activity. Columns\n'
             '(activities) and rows (land use) have been\n'
             'ordered using 2-D cosine-similarity\n'
             'clustering, with the goal to group cells of\n'
             'similar patterns. The base measure here is\n'
             'Userdays (see Wood, Guerry, Silver, & Lacayo,\n'
             '2013. Each user is counted once per day and\n'
             'activity.'
       ).opts(
    height=450, show_frame=False, hooks=[hook], text_align='left', text_font_size='13px')
Out[64]:

Export to svg

In [55]:
from bokeh.io import export_svgs
p =  hv.render(heatm, backend='bokeh')
p.output_backend = "svg"
export_svgs(p, filename="heatmap_userdays_greens.svg")
Out[55]:
['heatmap_userdays.svg']