Part 4: Topic Classification of Social Media

Workshop: Social Media, Data Science, & Cartograpy
Alexander Dunkel, Madalina Gugulica

This is the fourth notebook in a series of four notebooks:

  1. Introduction to Social Media data, jupyter and python spatial visualizations
  2. Introduction to privacy issues with Social Media data and possible solutions for cartographers
  3. Specific visualization techniques example: TagMaps clustering
  4. Specific data analysis: Topic Classification

Open these notebooks through the file explorer on the left side.

For this notebook, we use a another environment that must be linked first.

In [ ]:
!cd .. && sh activate_topic_env.sh

Introduction: Social Media & Topic-based text classification

The content social media users share on different platforms is extremely diverse encompassing a very wide range of topics including valuable information related to the way people perceive, relate to and use different environments. In order to harness these large volumes of data, specific tools and techniques to organize, search and understand these vast quantities of information are needed.</br>

Text classification is a Natural Language Processing task that aims at mapping documents (in our case social media posts) into a set of predefined categories. Supervised machine learning classifiers have shown great success in performing these tasks. Nevertheless, they require large volumes of labeled data for training, which are generally not available for social media data and can be very time-consuming and expensive to obtain.</br>

This notebook introduces a practical and unsupervised approach (which requires no labeled data) to thematically classify social media posts into specific categories (topics) simply described by a label. The underlying assumption of this approach is that Word Embeddings can be used to classify documents when no labeled training data is available. </br>

The method is based on the comparison of the textual semantic similarity between the most relevant words in each social media post and a list of keywords for each targeted category reflecting its semantics field (in linguistics, a semantic field is a set of words grouped by their meaning, that refers to a specific subject). The strenght of this approach is represented by its simplicity, however, its success depends on a good definition for each topic reflected in the list of keywords.</br>

Methodology

How do we make machines understand text data? It is generally known that machines are experts when dealing and working with numerical data but their performance decreases if they are fed raw text data.

The idea is to create numerical representation of words that capture their meanings, semantic relationships and the different contexts they are used in. For the coversion of raw text into numbers, there are a few options out there. The simplest methodology when dealing with text is to create a word frequency matrix that simply counts the occurrence of each word ( bag-of-words). An enhance version of this method is to estimate the log scaled frequency of each word considering its occurrence in all documents (tf-idf). Nevertheless, these methods capture solely frequencies of words and no contextual information or high level semantics of text.

A recent advance on the field of Natural Language Processing proposed the use of word embeddings for the numerical representation of text.

The method adopted computes the semantic similarity between different words or group of words and determines which words are semantically related to each other and belong to the same semantic field. Furthermore, computing the distance in the vector space (cosine distance) between the centroid of the word vectors that belong to a certain topic (semantic field) and the centroid of the word vectors that compose a social media posts will allow us to verify if the textual metadata associated with the posts is related to a specific category (Binary Classification).

In this notebook we will implement the Word2Vec word embedding technique used for creating word vectors with Python's Gensim library. </br>

1. Preparations

Load Dependencies

In [1]:
import pandas as pd
import pickle
import scipy
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output, Markdown, display

We're creating several output graphics and temporary files.

These will be stored in the subfolder notebooks/out/.

In [2]:
from pathlib import Path
OUTPUT = Path.cwd() / "out"
OUTPUT.mkdir(exist_ok=True)

1.1. Load the pre-trained Word2Vec model and the idfscores dictionary


The model was trained on a corpus, that has been previously prepared by filtering, cleaning and normnalizing over 1.5 M Instagram, Flickr and Twitter posts geolocated within Dresden and Heidelberg.

Parameters were chosen according to the semantic similarity performance results reported in *Efficient Estimation of Word Representations in Vector Space* - T. Mikolov et al.(2013)

  • size (vector size): 300
  • alpha (initial learning rate) : 0.025
  • window: 5
  • min_count: 5
  • min_alpha: 0.0001 Learning rate will linearly drop to min_alpha as training progresses
  • sg: 1 (SkipGram architecture - predicting context words based on current one)
  • negative (negative samples): 5 If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • ns_exponent = 0.75 The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper.
  • iter : 15 Number of iterations (epochs) over the corpus.
In [3]:
%load_ext autoreload
%autoreload 2

Prepare paths..

In [4]:
import sys

INPUT = Path.cwd() / "input"

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules import tools
In [5]:
source = "topic_data.zip"

Download sample data. This may take some time.

In [6]:
%%time
sample_url = tools.get_sample_url()
zip_uri = f'{sample_url}/download?path=%2F&files='
tools.get_zip_extract(
    uri=zip_uri,
    filename=source,
    output_path=INPUT,
    write_intermediate=True)
Loaded 478.88 MB of 478.88 (100%)..
Extracting zip..
Retrieved topic_data.zip, extracted size: 549.67 MB
CPU times: user 22.6 s, sys: 12.4 s, total: 35 s
Wall time: 1min 51s

load the pretrained word2vec model usign gensim's Word2Vec

In [7]:
from gensim import utils
from gensim.models import Word2Vec
model_w2v = Word2Vec.load(
    str(INPUT / "word2vec.model"))

For the creation of the post_embedding we will need idf-score weights, which were prepared beforehand and stored in the input folder as a serialized (pickled) dictionary

In [8]:
#idf-scores dictionary deserialization
with open(INPUT / 'idf_scores_dict.pkl', 'rb') as handle:
    idf_scores_dict = pickle.load(handle)

1.2. Define the functions that will help compute the average topic and post vectors

Functions to compute the average topic and post vectors

In [9]:
def avg_topic_vector(lang_model, tokens_list):
    # remove out-of-vocabulary words
    tokens = []
    for token in tokens_list:
        if token in lang_model.wv.vocab:
            tokens.append(token)
    return np.average(lang_model[tokens], axis=0)

def avg_post_vector(lang_model, tokens_list,idf):
    # remove out-of-vocabulary words
    tokens = []
    weights = []
    for token in tokens_list:
        if token in lang_model.wv.vocab:
            tokens.append(token)
            tf = tokens_list.count(token)
            tfidf= tf*idf[token]
            weights.append(tfidf)
    return np.average(lang_model[tokens], weights =weights, axis=0)


def has_vector_representation(lang_model, upl):
    """check if at least one word of the document is in the
    word2vec dictionary"""
    n= len([w for w in upl if w in lang_model.wv.vocab])
    if n>0:
        return True
    else:
        return False

2. Load Preprocessed Data

The textual content of social media data has a low degree of formal semantic and syntactic accuracy. In order to provide only significant information for the text classification task to be performed, the text (posttitle, post body and tags) needed to be preprocessed according to the following actions:

  • lowercasing
  • extract hashtags and individual words (tokenization)
  • remove mentions (@username)
  • remove punctuation
  • remove the URLs (http:\ as well as www.)
  • remove html tags (<>)
  • remove digits
  • identify and select only English and German posts
  • remove stopwords (commonly used words such as “the”, “a”, “an”, “in”, etc.)
In [10]:
filename = "DD_Neustadt_NormalizedInstagramPosts.pickle"
df = pd.read_pickle(INPUT / filename)
print(len(df))
115370
In [11]:
df.head()
Out[11]:
latitude longitude post_date post_text post_thumbnail_url post_views_count post_like_count post_url post_geoaccuracy post_comment_count post_type place_guid place_name
0 51.088468 13.765694 1/1/2015 0:12 party newyear fucus goals threegetbig abfuckcl... NaN NaN 131.0 NaN place 11.0 image 32b8350c4ddb8da9ecc30c341035a469 Sektor Evolution
1 51.088468 13.765694 1/1/2015 0:12 party newyear fucus goals threegetbig abfuckcl... NaN NaN 131.0 NaN place 11.0 image 32b8350c4ddb8da9ecc30c341035a469 Sektor Evolution
2 51.056450 13.741490 1/1/2015 0:32 happynewyear party friends frohes neues jahr h... NaN NaN 35.0 NaN place NaN image 3defd67cdc22dba58e7a849a02a5f3bd Elbufer
3 51.056164 13.740268 1/1/2015 0:43 happynewyear silvester augustusbrücke love hau... NaN NaN 21.0 NaN place 0.0 image fef2b329a4f6b78c1fd341c94fa5ec73 Augustusbrücke
4 51.056450 13.741490 1/1/2015 2:02 happy new year everyone world gesundes neues j... NaN 0.0 30.0 NaN place 1.0 image 3defd67cdc22dba58e7a849a02a5f3bd Elbufer

3. Topic-Based Classification of Social Media Posts

Workflow

The classification of the social media posts is based on the calculation of the similarity score (cosine similarity) between a topic embedding & the post embeddings and follows the workflow exposed below:

  1. for each label (topic) a list of relevant keywords is defined and enhanced by seeking further semantically similar words through the identification of the most similar word vectors (which are closely located in the vector space)

  2. a topic embedding will be created by averaging all the vectors representing the keywords in the list previously defined

  3. the vector represention of each social media post is created by averaging the weighted word embeddings of all words and the weight of a word is given by its tf-idf score.

  1. the classification will follow after the calculation of a similarity score between each pair of post vector - topic vector using cosine distance and an empirically identified similarity threshold (70%) is considered decisive
In [12]:
topic_list = ['event','music','festival','concert']
In [13]:
enhanced_list = []
for keyword in topic_list:
    similar_words = model_w2v.wv.most_similar(
        positive = [keyword], topn = 50)
    enhanced_list += ([w[0] for w in similar_words])

some words might repeat, therefore we will save the list as a set of unique strings

In [14]:
topic_list = topic_list + enhanced_list
topic_list = set(topic_list)
In [15]:
topic_embedding = avg_topic_vector(model_w2v,topic_list)

WordCloud representing the topic selected

To visualize the enhanced list of keywords representative for the chosen topic and used for the calculation of the topic embedding, we use the WordCloud library.

Note: Some of the tags identified might refer to the city of Heidelberg since the word2vec model was trained on social media posts that were published within both Dresden and Heidelberg.

In [16]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud        
words = ' '.join(topic_list) 
wordcloud = WordCloud(background_color="white").generate(words)
# Display the generated image:
plt.figure(figsize = (10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [17]:
%%time
df = df.reindex(df.columns.tolist() + ['classification','cos_dist'], axis=1)
x = 0
total_records = len(df)
for index, row in df.iterrows():
    x+=1
    msg_text = (
        f'Processed records: {x} ({x/(total_records/100):.2f}%). ')
    if x % 100 == 0:
        clear_output(wait=True)
        print(msg_text)
        
    text = row['post_text'].split(' ')
    
    if has_vector_representation(model_w2v, text) == True:
        #create the post embedding
        post_embedding = avg_post_vector(model_w2v, text,idf_scores_dict)
        cos_dist = scipy.spatial.distance.cosine(topic_embedding, post_embedding, w=None)                                              
        if cos_dist <0.3:
            df.at[index,'classification'] = 1
            df.at[index,'cos_dist'] = cos_dist
        else:
            df.at[index,'classification'] = 0
            df.at[index,'cos_dist'] = cos_dist                               
            

# final status
clear_output(wait=True)
print(msg_text)

df.to_pickle(OUTPUT/ 'DD_Neustadt_ClassifiedInstagramPosts.pickle')
Processed records: 115370 (100.00%). 
CPU times: user 1min 6s, sys: 2.65 s, total: 1min 9s
Wall time: 1min 5s
In [18]:
df_classified = df[df['classification'] == 1]
print ("The algorithm identified", len(df_classified), "social media posts related to music events in Dresden Neustadt")
The algorithm identified 3294 social media posts related to music events in Dresden Neustadt

4. Interactive visualization of the classified posts using bokeh

Load dependencies

In [19]:
import geopandas as gp
import holoviews as hv
import geoviews as gv
from cartopy import crs as ccrs
hv.notebook_extension('bokeh')

Convert the pandas dataframe into a geopandas dataframe

df_classified is a subset of the original dataframe and the index values will correspond with the ones in the orginal dataframe.We will reset the index values so the first record of the subset gets the index 0.

In [20]:
df_classified.reset_index()
Out[20]:
index latitude longitude post_date post_text post_thumbnail_url post_views_count post_like_count post_url post_geoaccuracy post_comment_count post_type place_guid place_name classification cos_dist
0 154 51.055945 13.744381 1/12/2015 15:08 sunriseave samuhaber rikurajamaa samiosala rau... NaN 0.0 25.0 NaN place 0.0 image f81888714c4f19084d1e55d8084ee720 Dresden Elbufer 1.0 0.238382
1 315 51.072870 13.737680 1/23/2015 19:04 moocher minnithemoocher ska skapunk rock punk ... NaN 0.0 19.0 NaN place 2.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.278942
2 316 51.072870 13.737680 1/23/2015 19:04 moocher minnithemoocher ska skapunk rock punk ... NaN 0.0 19.0 NaN place 2.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.278942
3 323 51.072870 13.737680 1/23/2015 23:14 ska skapunk berlin minnithemoocher distemper c... NaN 0.0 179.0 NaN place 0.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.249914
4 324 51.072870 13.737680 1/23/2015 23:14 ska skapunk berlin minnithemoocher distemper c... NaN 0.0 179.0 NaN place 0.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.249914
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3289 115152 51.084180 13.764640 8/13/2018 16:48 tama drums tamadrums strongestnameindrums drum... NaN 0.0 56.0 NaN place 0.0 image e843df4c911cbab26356b97a41dd8a35 Zoundhouse Dresden 1.0 0.283054
3290 115153 51.084180 13.764640 8/13/2018 16:48 tama drums tamadrums strongestnameindrums drum... NaN 0.0 56.0 NaN place 0.0 image e843df4c911cbab26356b97a41dd8a35 Zoundhouse Dresden 1.0 0.283054
3291 115177 51.056001 13.744443 8/13/2018 19:16 rolandkaiser kaisermania kaisermania schlager ... NaN 0.0 29.0 NaN place 1.0 image 5733680fc06018d772f976fe20fe40fb Filmnächte am Elbufer 1.0 0.299322
3292 115206 51.069635 13.733809 8/14/2018 6:50 discodicedj discodice unitedmusicfestival hous... NaN 0.0 25.0 NaN place 0.0 image 87295a86f48af1da33fc571c5cec39aa Alter Schlachthof 1.0 0.252726
3293 115219 51.069635 13.733809 8/14/2018 10:05 nightlife konzert johannesoerding johannesoerd... NaN 0.0 2.0 NaN place 0.0 image 87295a86f48af1da33fc571c5cec39aa Alter Schlachthof 1.0 0.284293

3294 rows × 16 columns

In [21]:
gdf = gp.GeoDataFrame(
    df_classified, geometry=gp.points_from_xy(df_classified.longitude, df_classified.latitude))
In [22]:
CRS_PROJ = "epsg:3857" # Web Mercator
CRS_WGS = "epsg:4326" # WGS1984
gdf.crs = CRS_WGS # Set projection
gdf = gdf.to_crs(CRS_PROJ) # Project

Have a look at the geodataframe

In [23]:
gdf.head()
Out[23]:
latitude longitude post_date post_text post_thumbnail_url post_views_count post_like_count post_url post_geoaccuracy post_comment_count post_type place_guid place_name classification cos_dist geometry
154 51.055945 13.744381 1/12/2015 15:08 sunriseave samuhaber rikurajamaa samiosala rau... NaN 0.0 25.0 NaN place 0.0 image f81888714c4f19084d1e55d8084ee720 Dresden Elbufer 1.0 0.238382 POINT (1530017.463 6631195.754)
315 51.072870 13.737680 1/23/2015 19:04 moocher minnithemoocher ska skapunk rock punk ... NaN 0.0 19.0 NaN place 2.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.278942 POINT (1529271.542 6634193.718)
316 51.072870 13.737680 1/23/2015 19:04 moocher minnithemoocher ska skapunk rock punk ... NaN 0.0 19.0 NaN place 2.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.278942 POINT (1529271.542 6634193.718)
323 51.072870 13.737680 1/23/2015 23:14 ska skapunk berlin minnithemoocher distemper c... NaN 0.0 179.0 NaN place 0.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.249914 POINT (1529271.542 6634193.718)
324 51.072870 13.737680 1/23/2015 23:14 ska skapunk berlin minnithemoocher distemper c... NaN 0.0 179.0 NaN place 0.0 image 11caaa1f7061e3ed78b97dd3fe6789af Chemiefabrik 1.0 0.249914 POINT (1529271.542 6634193.718)
In [30]:
x = gdf.loc[gdf.first_valid_index()].geometry.x
y = gdf.loc[gdf.first_valid_index()].geometry.y

margin = 1000 # meters
bbox_bottomleft = (x - margin, y - margin)
bbox_topright = (x + margin, y + margin)
gdf.loc[0] ?
  • gdf.loc[0] is the loc-indexer from pandas. It means: access the first record of the (Geo)DataFrame
  • .geometry.x is used to access the (projected) x coordinate geometry (point). This is only available for GeoDataFrame (geopandas)
In [31]:
posts_layer = gv.Points(
    df_classified,
    kdims=['longitude', 'latitude'],
    vdims=['post_text'],
    label='Instagram Post')
In [32]:
from bokeh.models import HoverTool
from typing import Dict, Optional
def get_custom_tooltips(
        items: Dict[str, str]) -> str:
    """Compile HoverTool tooltip formatting with items to show on hover"""
    tooltips = ""
    if items:
        tooltips = "".join(
            f'<div><span style="font-size: 12px;">'
            f'<span style="color: #82C3EA;">{item}:</span> '
            f'@{item}'
            f'</span></div>' for item in items)
    return tooltips
In [33]:
def set_active_tool(plot, element):
    """Enable wheel_zoom in bokeh plot by default"""
    plot.state.toolbar.active_scroll = plot.state.tools[0]

# prepare custom HoverTool
tooltips = get_custom_tooltips(items=['post_text'])
hover = HoverTool(tooltips=tooltips) 
    
gv_layers = hv.Overlay(
    gv.tile_sources.CartoDark * \
    posts_layer.opts(
        tools=['hover'],
        size=8,
        line_color='black',
        line_width=0.1,
        fill_alpha=0.8,
        fill_color='#ccff00') 
    )

Store map as static HTML file

In [34]:
gv_layers.opts(
    projection=ccrs.GOOGLE_MERCATOR,
    title= "Music Festivals and Concerts in Dresden Neustadt according to Instagram Posts",
    responsive=True,
    xlim=(bbox_bottomleft[0], bbox_topright[0]),
    ylim=(bbox_bottomleft[1], bbox_topright[1]),
    data_aspect=0.45, # maintain fixed aspect ratio during responsive resize
    hooks=[set_active_tool])
hv.save(
    gv_layers, OUTPUT / f'topic_map.html', backend='bokeh')

Display in-line view of the map:

In [35]:
gv_layers.opts(
    width=800,
    height=480,
    responsive=False,
    hooks=[set_active_tool],
    title= "Music Festivals and Concerts in Dresden Neustadt according to Instagram Posts" ,
    projection=ccrs.GOOGLE_MERCATOR,
    data_aspect=1,
    xlim=(bbox_bottomleft[0], bbox_topright[0]),
    ylim=(bbox_bottomleft[1], bbox_topright[1])
    )
Out[35]:

Create Notebook HTML

In [ ]:
!jupyter nbconvert --to html \
    --output-dir=./out/ ./04_topic_classification.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-

Clean up input folder

In [ ]:
tools.clean_folders(
    [Path.cwd() / "input"])

Summary