Term frequency-inverse document frequency (TFIDF) and Cosine Similarity

Alexander Dunkel, TU Dresden, Institute of Cartography; Maximilian Hartmann and Ross Purves Universität Zürich (UZH), Geocomputation;


•••
Out[1]:

Last updated: Jan-17-2023, Carto-Lab Docker Version 0.9.0

Visualization of TFIDF and Cosine Similarity Values

The values loaded here have been generated outside Jupyter, in a separate process. This notebook only visualizes data.

Preparations

Load dependencies

We continue from notebook 05_countries.ipynb, importing all previously defined methods and top level variables.

In [2]:
import sys
from pathlib import Path
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
# import all previous chained notebooks
from _05_countries import *
Chromedriver loaded. Svg output enabled.

Activate autoreload of changed python files:

In [3]:
%load_ext autoreload
%autoreload 2

Load aggregate topic data

Data is stored as aggregate HLL data (postcount) for each term.

In [4]:
root = Path.cwd().parents[1] / "00_topic_data"
TERMS_FLICKR_TFIDF = root / "20210202_FLICKR_SUNSET_random_country_tf_idf.csv"
TERMS_FLICKR_COSINE = root / "20211029_FLICKR_SUNSET_random_country_cosine_similarity_binary.csv"

Some statistics for these files:

In [5]:
%%time
data_files = {
    "TERMS_FLICKR_TFIDF":TERMS_FLICKR_TFIDF,
    "TERMS_FLICKR_COSINE":TERMS_FLICKR_COSINE,
    }
tools.display_file_stats(data_files)
name TERMS_FLICKR_TFIDF TERMS_FLICKR_COSINE
size 57.85 KB 974.43 KB
records 226 226
CPU times: user 33.9 ms, sys: 244 µs, total: 34.1 ms
Wall time: 104 ms

Load Cosine Similarity

Get as pandas dataframe

In [6]:
def load_cosine_df(csv: Path = TERMS_FLICKR_COSINE) -> pd.DataFrame:
    """Load CSV with cosine similarity values per country"""
    df = pd.read_csv(csv, encoding='utf-8', skiprows=0, index_col=0)
    # Since this is a matrix of similarity values, 
    # set index = column names and skip first row (header)
    df.columns = df.index
    return df
In [7]:
df_cos = load_cosine_df()
In [8]:
df_cos.head()
Out[8]:
BFR INX CHE IDN USB ITX ZAX MEX CAN ENG ... CPV AZE HND MDA ALD INA BLM LIE ITP GUF
BFR 1.000000 0.176804 0.194586 0.192786 0.088246 0.138522 0.201482 0.175484 0.164773 0.118088 ... 0.101952 0.097359 0.125799 0.095063 0.082795 0.105904 0.074893 0.087826 0.051295 0.105382
INX 0.176804 1.000000 0.166689 0.185919 0.109039 0.142907 0.187739 0.162928 0.166713 0.138435 ... 0.074961 0.070127 0.095374 0.066170 0.060736 0.087176 0.050504 0.058790 0.036975 0.069739
CHE 0.194586 0.166689 1.000000 0.171150 0.096301 0.157435 0.178351 0.159385 0.164147 0.122183 ... 0.080351 0.072332 0.099305 0.070520 0.063698 0.077771 0.054618 0.074115 0.043080 0.083936
IDN 0.192786 0.185919 0.171150 1.000000 0.085915 0.127157 0.193810 0.171063 0.156392 0.112681 ... 0.096579 0.087764 0.118133 0.083325 0.078152 0.099819 0.063213 0.075742 0.049091 0.093572
USB 0.088246 0.109039 0.096301 0.085915 1.000000 0.110259 0.091638 0.091926 0.130047 0.143619 ... 0.026635 0.023089 0.037211 0.023858 0.019907 0.025114 0.018317 0.020239 0.014332 0.026182

5 rows × 225 columns

Load TFIDF

In [9]:
def load_tfidf_df(csv: Path = TERMS_FLICKR_TFIDF) -> pd.DataFrame:
    """Load CSV with TFIDF ranking for country"""
    df = pd.read_csv(csv, encoding='utf-8', header=0, index_col=0)
    return df
In [10]:
df_tfidf = load_tfidf_df()
In [11]:
df_tfidf.head()
Out[11]:
TERM_1 TF_IDF_1 TERM_2 TF_IDF_2 TERM_3 TF_IDF_3 TERM_4 TF_IDF_4 TERM_5 TF_IDF_5 ... TERM_16 TF_IDF_16 TERM_17 TF_IDF_17 TERM_18 TF_IDF_18 TERM_19 TF_IDF_19 TERM_20 TF_IDF_20
COUNTRY_CODE
ABW sunset 237.04 beach 81.75 ocean 75.27 sun 71.35 palm 69.74 ... sand 40.54 and 36.71 arubasunset 36.53 boats 36.00 island 34.75
ACA sunset 103.00 heights 31.31 shirley 31.29 english 27.28 clouds 25.68 ... bay 16.95 shirleyheights 16.34 englishharbour 16.28 from 16.18 this 16.05
AFG sunset 124.64 sun 27.91 mountains 24.86 clouds 21.84 over 20.47 ... war 10.08 army 10.08 evening 9.93 shadow 9.85 light 9.85
AGO sunset 89.23 sun 28.24 sky 20.02 the 18.21 landscape 16.94 ... nature 10.54 okavango 10.30 near 10.30 luanda 9.90 and 9.76
AIA sunset 28.30 caribbean 11.88 ocean 7.87 the 6.86 sun 6.81 ... beautiful 4.84 landscape 4.84 sea 4.04 anguilla 3.98 our 3.94

5 rows × 40 columns

Combine top terms into single column, drop all other columns

In [12]:
cols = [f'TERM_{ix}'for ix in range(1,20)]
df_tfidf['tfidf'] = df_tfidf[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
drop_cols_except(df_tfidf, ['tfidf'])
df_tfidf.head()
Out[12]:
tfidf
COUNTRY_CODE
ABW sunset beach ocean sun palm sky sea clouds the...
ACA sunset heights shirley english clouds sun the ...
AFG sunset sun mountains clouds over sky dusk land...
AGO sunset sun sky the landscape namibia africa ri...
AIA sunset caribbean ocean the sun clouds sky trav...

Combine with country shapes

Load country geometries

In [13]:
def load_country_geom(
    ne_path: Path = NE_PATH, ne_uri: str = NE_URI, ne_filename: str = NE_FILENAME,
    crs_proj: str = CRS_PROJ, country_col: str = COUNTRY_COL) -> gp.GeoDataFrame:
    """Load country geometry and set SU_A3 column as index"""
    world = gp.read_file(
        ne_path / ne_filename.replace(".zip", ".shp"))
    world = world.to_crs(crs_proj)
    columns_keep = ['geometry', country_col, 'ADMIN']
    drop_cols_except(world, columns_keep)
    world.set_index(country_col, inplace=True)
    return world
In [14]:
world = load_country_geom()
world.head()
Out[14]:
ADMIN geometry
SU_A3
ZWE Zimbabwe POLYGON ((2987278.542 -2742733.921, 2979383.40...
ZMB Zambia POLYGON ((2976200.722 -1924957.705, 2961959.54...
YEM Yemen MULTIPOLYGON (((5181525.454 2047361.573, 51352...
YES Yemen POLYGON ((5307347.563 1557616.990, 5313584.814...
VNM Vietnam MULTIPOLYGON (((10323687.558 1282070.654, 1032...

This GeoDataFrame can be visualized using interactive Holoviews:

In [15]:
gv.Polygons(world, crs=crs.Mollweide())
Out[15]: