Reddit API queries in Jupyter

PRAW is the Python Reddit API Wrapper. In this notebook, I use the Code Flow with Refresh Tokens to authenticate in Jupyter (running in Docker) and request sample submissions and comments.

Since the submissions.new() API endpoint is limited to the latest 1000 submissions, I am using pmaw to collect all original submissions for a single subreddit (r/yosemite) from pushshift.io, a website and database which logs of all of the posts that go on Reddit when they get posted. ^1. Additional attributes are fetched alongside using the original Reddit API.

Interestingly, there are about 200 new submissions for the r/Yosemite subreddit shared on average each month, which is more than I expected. Our goal is to study this data regarding differences and commonalities when communicating about National Parks in the US and Europe.

In the Second notebook I also show how to gradually turn a Jupyter Notebook into package. With the help of Jupytext, all methods defined in the notebook can be imported in a standalone python script (get_all_submissions.py, get_all_comments.py), to be directly accessible via command line. The trick here is to add active-ipynb tags to all cells that should not be available in the python version of the notebook.

Then everything else defined in the notebook can be imported with from _pmaw import *, see the get_all_submissions.py below.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Load all submissions for a list of subreddits.
"""
__author__ = "Alexander Dunkel"
__license__ = "GNU GPLv3"

from typing import List, Dict
from _pmaw import * # import all methods from notebook

def main():
    """Main cli method to query all submissions and comments for a list of subreddits"""
    total_dict: Dict[str, int] = {}
    API.shards_down_behavior = "stop" # default: warn
    for subreddit in PARKS_SUBREDDITS:
        SUBREDDIT = subreddit
        (OUTPUT / SUBREDDIT).mkdir(exist_ok=True)
        total_queried = query_time(
            start_year=2010, start_month=1,
            end_year=2023, end_month=4, subreddit=SUBREDDIT)
        print(f'Finished {subreddit} with {total_queried} submissions queried')
        total_dict[subreddit] = total_queried

if __name__ == "__main__":
    main()