Accessing information silos with playwright

Social discourse is increasingly siloed behind paywalls, login barriers, and other difficult-to-access restrictions. Conversely, there is a growing need to summarize communication patterns to understand and monitor social cohesion and use this knowledge to protect and strengthen democracy.

This was a test to expand the knowledge beyond the big networks (Reddit, Facebook, Instagram), to also include smaller boards of communication and discourse on particular topics.

I wanted to archive a long discussion thread from photovoltaikforum.com for a test LLM analysis. The thread contains more than 1000 pages and cannot be easily exported. Simple HTTP scraping does not work because the forum requires a fully rendered browser session.

The solution was to use Playwright with Python.

The script opens the thread in a headless browser, iterates through all pages and extracts:

  • author
  • timestamp
  • post content
  • page number

The results are streamed into a JSON file so progress is not lost if the script stops.

TL;DR

Here is the script:

import asyncio
import json
from pathlib import Path
from playwright.async_api import async_playwright

THREAD_URL = "https://www.photovoltaikforum.com/thread/241171-marstek-venus-c-e-ac-speicher-5-12-kwh-erfahrungen-installation-leistung-im-allt/"

OUTPUT_MD = "thread.md"
OUTPUT_JSON = "thread.json"


OUTPUT_JSON = "thread.json"

async def scrape_thread():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        page_number = 1
        max_pages = 1481

        # Open JSON file in streaming mode
        json_path = Path(OUTPUT_JSON)

        # If file exists, load previous progress
        if json_path.exists():
            with open(json_path, "r", encoding="utf-8") as f:
                all_posts = json.load(f)
        else:
            all_posts = []

        while page_number <= max_pages:
            url = f"{THREAD_URL}?pageNo={page_number}"
            print(f"Scraping page {page_number}")

            try:
                await page.goto(url, wait_until="domcontentloaded", timeout=60000)
            except Exception as e:
                print(f"Page {page_number} failed: {e}")
                page_number += 1
                continue

            await page.wait_for_selector(".message", timeout=30000)

            posts = await page.locator(".message").all()

            page_posts = []

            for post in posts:
                try:
                    author_locator = post.locator(".username").first

                    if await author_locator.count() > 0:
                        author = (await author_locator.inner_text()).strip()
                    else:
                        author = "unknown"

                    times = await post.locator("time").all()
                    timestamp = "unknown"

                    for t in times:
                        dt = await t.get_attribute("datetime")
                        if dt:
                            timestamp = dt
                            break

                    content_locator = post.locator(".messageContent").first
                    content = ""

                    if await content_locator.count() > 0:
                        content = (await content_locator.inner_text()).strip()

                    page_posts.append({
                        "author": author.strip(),
                        "timestamp": timestamp,
                        "content": content.strip(),
                        "page": page_number
                    })
                except Exception as e:
                    print(f"Skipped broken post on page {page_number}: {e}")
                    
            # stream save results to file
            all_posts.extend(page_posts)

            with open(json_path, "w", encoding="utf-8") as f:
                json.dump(all_posts, f, indent=2, ensure_ascii=False)

            print(f"Saved page {page_number}")

            page_number += 1

        await browser.close()

    return


def save_markdown(posts):
    with open(OUTPUT_MD, "w", encoding="utf-8") as f:
        for post in posts:
            f.write(f"## {post['author']}\n")
            f.write(f"*{post['timestamp']}*\n\n")
            f.write(post["content"])
            f.write("\n\n---\n\n")


def save_json(posts):
    with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
        json.dump(posts, f, indent=2, ensure_ascii=False)


async def main():
    posts = await scrape_thread()
    print(f"Collected {len(posts)} posts")

    save_markdown(posts)
    save_json(posts)

    print("Saved thread.md and thread.json")


if __name__ == "__main__":
    asyncio.run(main())

Setup

Requirements:

  • Linux / WSL2 recommended
  • Python 3.8–3.11
  • Playwright

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:

pip install playwright
playwright install
playwright install-deps

Usage

Edit the thread URL in download_thread.py if needed, then run:

python download_thread.py

The script will:

  • Iterate through all pages of the thread
  • Extract posts (author, timestamp, content)
  • Continuously save results to thread.json

Progress is written after each page to avoid data loss if the script stops.

Output:

thread.json

Structure example:

{
  "author": "username",
  "timestamp": "2025-08-17T14:17:53.000Z",
  "content": "post text",
  "page": 5
}

Notes

  • The scraper uses Playwright with Chromium.
  • If a page fails to load, it will be skipped and the script continues.
  • Query of 1500 pages discussion took about 15 Minutes