Part 4: Topic Classification of Social Media

Workshop: Social Media, Data Science, & Cartograpy
Alexander Dunkel, Madalina Gugulica

This is the fourth notebook in a series of four notebooks:

  1. Introduction to Social Media data, jupyter and python spatial visualizations
  2. Introduction to privacy issues with Social Media data and possible solutions for cartographers
  3. Specific visualization techniques example: TagMaps clustering
  4. Specific data analysis: Topic Classification

Open these notebooks through the file explorer on the left side.

For this notebook, we use a another environment that must be linked first.

In [ ]:
!cd .. && sh

Introduction: Social Media & Topic-based text classification

The content social media users share on different platforms is extremely diverse encompassing a very wide range of topics including valuable information related to the way people perceive, relate to and use different environments. In order to harness these large volumes of data, specific tools and techniques to organize, search and understand these vast quantities of information are needed.</br>

Text classification is a Natural Language Processing task that aims at mapping documents (in our case social media posts) into a set of predefined categories. Supervised machine learning classifiers have shown great success in performing these tasks. Nevertheless, they require large volumes of labeled data for training, which are generally not available for social media data and can be very time-consuming and expensive to obtain.</br>

This notebook introduces a practical and unsupervised approach (which requires no labeled data) to thematically classify social media posts into specific categories (topics) simply described by a label. The underlying assumption of this approach is that Word Embeddings can be used to classify documents when no labeled training data is available. </br>

The method is based on the comparison of the textual semantic similarity between the most relevant words in each social media post and a list of keywords for each targeted category reflecting its semantics field (in linguistics, a semantic field is a set of words grouped by their meaning, that refers to a specific subject). The strenght of this approach is represented by its simplicity, however, its success depends on a good definition for each topic reflected in the list of keywords.</br>


How do we make machines understand text data? It is generally known that machines are experts when dealing and working with numerical data but their performance decreases if they are fed raw text data.

The idea is to create numerical representation of words that capture their meanings, semantic relationships and the different contexts they are used in. For the coversion of raw text into numbers, there are a few options out there. The simplest methodology when dealing with text is to create a word frequency matrix that simply counts the occurrence of each word ( bag-of-words). An enhance version of this method is to estimate the log scaled frequency of each word considering its occurrence in all documents (tf-idf). Nevertheless, these methods capture solely frequencies of words and no contextual information or high level semantics of text.

A recent advance on the field of Natural Language Processing proposed the use of word embeddings for the numerical representation of text.

The method adopted computes the semantic similarity between different words or group of words and determines which words are semantically related to each other and belong to the same semantic field. Furthermore, computing the distance in the vector space (cosine distance) between the centroid of the word vectors that belong to a certain topic (semantic field) and the centroid of the word vectors that compose a social media posts will allow us to verify if the textual metadata associated with the posts is related to a specific category (Binary Classification).

In this notebook we will implement the Word2Vec word embedding technique used for creating word vectors with Python's Gensim library. </br>

1. Preparations

Load Dependencies

In [1]:
import pandas as pd
import pickle
import scipy
import numpy as np
import matplotlib.pyplot as plt
import warnings
from IPython.display import clear_output, Markdown, display

We're creating several output graphics and temporary files.

These will be stored in the subfolder notebooks/out/.

In [2]:
from pathlib import Path
OUTPUT = Path.cwd() / "out"

1.1. Load the pre-trained Word2Vec model and the idfscores dictionary

The model was trained on a corpus, that has been previously prepared by filtering, cleaning and normnalizing over 1.5 M Instagram, Flickr and Twitter posts geolocated within Dresden and Heidelberg.

Parameters were chosen according to the semantic similarity performance results reported in *Efficient Estimation of Word Representations in Vector Space* - T. Mikolov et al.(2013)

  • size (vector size): 300
  • alpha (initial learning rate) : 0.025
  • window: 5
  • min_count: 5
  • min_alpha: 0.0001 Learning rate will linearly drop to min_alpha as training progresses
  • sg: 1 (SkipGram architecture - predicting context words based on current one)
  • negative (negative samples): 5 If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • ns_exponent = 0.75 The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper.
  • iter : 15 Number of iterations (epochs) over the corpus.
In [3]:
%load_ext autoreload
%autoreload 2

Prepare paths..

In [4]:
import sys

INPUT = Path.cwd() / "input"

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
from modules import tools
In [5]:
source = ""

Download sample data. This may take some time.

In [6]:
sample_url = tools.get_sample_url()
zip_uri = f'{sample_url}/download?path=%2F&files='
Loaded 478.88 MB of 478.88 (100%)..
Extracting zip..
Retrieved, extracted size: 549.67 MB
CPU times: user 22.6 s, sys: 12.4 s, total: 35 s
Wall time: 1min 51s

load the pretrained word2vec model usign gensim's Word2Vec

In [7]:
from gensim import utils
from gensim.models import Word2Vec
model_w2v = Word2Vec.load(
    str(INPUT / "word2vec.model"))

For the creation of the post_embedding we will need idf-score weights, which were prepared beforehand and stored in the input folder as a serialized (pickled) dictionary

In [8]:
#idf-scores dictionary deserialization
with open(INPUT / 'idf_scores_dict.pkl', 'rb') as handle:
    idf_scores_dict = pickle.load(handle)

1.2. Define the functions that will help compute the average topic and post vectors

Functions to compute the average topic and post vectors

In [9]:
def avg_topic_vector(lang_model, tokens_list):
    # remove out-of-vocabulary words
    tokens = []
    for token in tokens_list:
        if token in lang_model.wv.vocab:
    return np.average(lang_model[tokens], axis=0)

def avg_post_vector(lang_model, tokens_list,idf):
    # remove out-of-vocabulary words
    tokens = []
    weights = []
    for token in tokens_list:
        if token in lang_model.wv.vocab:
            tf = tokens_list.count(token)
            tfidf= tf*idf[token]
    return np.average(lang_model[tokens], weights =weights, axis=0)

def has_vector_representation(lang_model, upl):
    """check if at least one word of the document is in the
    word2vec dictionary"""
    n= len([w for w in upl if w in lang_model.wv.vocab])
    if n>0:
        return True
        return False

2. Load Preprocessed Data

The textual content of social media data has a low degree of formal semantic and syntactic accuracy. In order to provide only significant information for the text classification task to be performed, the text (posttitle, post body and tags) needed to be preprocessed according to the following actions:

  • lowercasing
  • extract hashtags and individual words (tokenization)
  • remove mentions (@username)
  • remove punctuation
  • remove the URLs (http:\ as well as www.)
  • remove html tags (<>)
  • remove digits
  • identify and select only English and German posts
  • remove stopwords (commonly used words such as “the”, “a”, “an”, “in”, etc.)
In [10]:
filename = "DD_Neustadt_NormalizedInstagramPosts.pickle"
df = pd.read_pickle(INPUT / filename)
In [11]:
latitude longitude post_date post_text post_thumbnail_url post_views_count post_like_count post_url post_geoaccuracy post_comment_count post_type place_guid place_name
0 51.088468 13.765694 1/1/2015 0:12 party newyear fucus goals threegetbig abfuckcl... NaN NaN 131.0 NaN place 11.0 image 32b8350c4ddb8da9ecc30c341035a469 Sektor Evolution
1 51.088468 13.765694 1/1/2015 0:12 party newyear fucus goals threegetbig abfuckcl... NaN NaN 131.0 NaN place 11.0 image 32b8350c4ddb8da9ecc30c341035a469 Sektor Evolution
2 51.056450 13.741490 1/1/2015 0:32 happynewyear party friends frohes neues jahr h... NaN NaN 35.0 NaN place NaN image 3defd67cdc22dba58e7a849a02a5f3bd Elbufer
3 51.056164 13.740268 1/1/2015 0:43 happynewyear silvester augustusbrücke love hau... NaN NaN 21.0 NaN place 0.0 image fef2b329a4f6b78c1fd341c94fa5ec73 Augustusbrücke
4 51.056450 13.741490 1/1/2015 2:02 happy new year everyone world gesundes neues j... NaN 0.0 30.0 NaN place 1.0 image 3defd67cdc22dba58e7a849a02a5f3bd Elbufer

3. Topic-Based Classification of Social Media Posts


The classification of the social media posts is based on the calculation of the similarity score (cosine similarity) between a topic embedding & the post embeddings and follows the workflow exposed below:

  1. for each label (topic) a list of relevant keywords is defined and enhanced by seeking further semantically similar words through the identification of the most similar word vectors (which are closely located in the vector space)

  2. a topic embedding will be created by averaging all the vectors representing the keywords in the list previously defined

  3. the vector represention of each social media post is created by averaging the weighted word embeddings of all words and the weight of a word is given by its tf-idf score.

  1. the classification will follow after the calculation of a similarity score between each pair of post vector - topic vector using cosine distance and an empirically identified similarity threshold (70%) is considered decisive
In [12]:
topic_list = ['event','music','festival','concert']
In [13]:
enhanced_list = []
for keyword in topic_list:
    similar_words = model_w2v.wv.most_similar(
        positive = [keyword], topn = 50)
    enhanced_list += ([w[0] for w in similar_words])

some words might repeat, therefore we will save the list as a set of unique strings

In [14]:
topic_list = topic_list + enhanced_list
topic_list = set(topic_list)
In [15]:
topic_embedding = avg_topic_vector(model_w2v,topic_list)

WordCloud representing the topic selected

To visualize the enhanced list of keywords representative for the chosen topic and used for the calculation of the topic embedding, we use the WordCloud library.

Note: Some of the tags identified might refer to the city of Heidelberg since the word2vec model was trained on social media posts that were published within both Dresden and Heidelberg.

In [16]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud        
words = ' '.join(topic_list) 
wordcloud = WordCloud(background_color="white").generate(words)
# Display the generated image:
plt.figure(figsize = (10,10))
plt.imshow(wordcloud, interpolation='bilinear')
In [17]:
df = df.reindex(df.columns.tolist() + ['classification','cos_dist'], axis=1)
x = 0
total_records = len(df)
for index, row in df.iterrows():
    msg_text = (
        f'Processed records: {x} ({x/(total_records/100):.2f}%). ')
    if x % 100 == 0:
    text = row['post_text'].split(' ')
    if has_vector_representation(model_w2v, text) == True:
        #create the post embedding
        post_embedding = avg_post_vector(model_w2v, text,idf_scores_dict)
        cos_dist = scipy.spatial.distance.cosine(topic_embedding, post_embedding, w=None)                                              
        if cos_dist <0.3:
  [index,'classification'] = 1
  [index,'cos_dist'] = cos_dist
  [index,'classification'] = 0
  [index,'cos_dist'] = cos_dist                               

# final status

df.to_pickle(OUTPUT/ 'DD_Neustadt_ClassifiedInstagramPosts.pickle')
Processed records: 115370 (100.00%). 
CPU times: user 1min 6s, sys: 2.65 s, total: 1min 9s
Wall time: 1min 5s
In [18]:
df_classified = df[df['classification'] == 1]
print ("The algorithm identified", len(df_classified), "social media posts related to music events in Dresden Neustadt")
The algorithm identified 3294 social media posts related to music events in Dresden Neustadt

4. Interactive visualization of the classified posts using bokeh

Load dependencies

In [19]:
import geopandas as gp
import holoviews as hv
import geoviews as gv
from cartopy import crs as ccrs