Mar 5, 2026

Accessing information silos with playwright

Social discourse is increasingly siloed behind paywalls, login barriers, and other difficult-to-access restrictions. Conversely, there is a growing need to summarize communication patterns to understand and monitor social cohesion and use this knowledge to protect and strengthen democracy.

This was a test to expand the knowledge beyond the big networks (Reddit, Facebook, Instagram), to also include smaller boards of communication and discourse on particular topics.

I wanted to archive a long discussion thread from photovoltaikforum.com for a test LLM analysis. The thread contains more than 1000 pages and cannot be easily exported. Simple HTTP scraping does not work because the forum requires a fully rendered browser session.

The solution was to use Playwright with Python.

The script opens the thread in a headless browser, iterates through all pages and extracts:

author
timestamp
post content
page number

The results are streamed into a JSON file so progress is not lost if the script stops.

TL;DR

Here is the script:

import asyncio
import json
from pathlib import Path
from playwright.async_api import async_playwright

THREAD_URL = "https://www.photovoltaikforum.com/thread/241171-marstek-venus-c-e-ac-speicher-5-12-kwh-erfahrungen-installation-leistung-im-allt/"

OUTPUT_MD = "thread.md"
OUTPUT_JSON = "thread.json"


OUTPUT_JSON = "thread.json"

async def scrape_thread():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        page_number = 1
        max_pages = 1481

        # Open JSON file in streaming mode
        json_path = Path(OUTPUT_JSON)

        # If file exists, load previous progress
        if json_path.exists():
            with open(json_path, "r", encoding="utf-8") as f:
                all_posts = json.load(f)
        else:
            all_posts = []

        while page_number <= max_pages:
            url = f"{THREAD_URL}?pageNo={page_number}"
            print(f"Scraping page {page_number}")

            try:
                await page.goto(url, wait_until="domcontentloaded", timeout=60000)
            except Exception as e:
                print(f"Page {page_number} failed: {e}")
                page_number += 1
                continue

            await page.wait_for_selector(".message", timeout=30000)

            posts = await page.locator(".message").all()

            page_posts = []

            for post in posts:
                try:
                    author_locator = post.locator(".username").first

                    if await author_locator.count() > 0:
                        author = (await author_locator.inner_text()).strip()
                    else:
                        author = "unknown"

                    times = await post.locator("time").all()
                    timestamp = "unknown"

                    for t in times:
                        dt = await t.get_attribute("datetime")
                        if dt:
                            timestamp = dt
                            break

                    content_locator = post.locator(".messageContent").first
                    content = ""

                    if await content_locator.count() > 0:
                        content = (await content_locator.inner_text()).strip()

                    page_posts.append({
                        "author": author.strip(),
                        "timestamp": timestamp,
                        "content": content.strip(),
                        "page": page_number
                    })
                except Exception as e:
                    print(f"Skipped broken post on page {page_number}: {e}")
                    
            # stream save results to file
            all_posts.extend(page_posts)

            with open(json_path, "w", encoding="utf-8") as f:
                json.dump(all_posts, f, indent=2, ensure_ascii=False)

            print(f"Saved page {page_number}")

            page_number += 1

        await browser.close()

    return


def save_markdown(posts):
    with open(OUTPUT_MD, "w", encoding="utf-8") as f:
        for post in posts:
            f.write(f"## {post['author']}\n")
            f.write(f"*{post['timestamp']}*\n\n")
            f.write(post["content"])
            f.write("\n\n---\n\n")


def save_json(posts):
    with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
        json.dump(posts, f, indent=2, ensure_ascii=False)


async def main():
    posts = await scrape_thread()
    print(f"Collected {len(posts)} posts")

    save_markdown(posts)
    save_json(posts)

    print("Saved thread.md and thread.json")


if __name__ == "__main__":
    asyncio.run(main())

Setup

Requirements:

Linux / WSL2 recommended
Python 3.8–3.11
Playwright

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:

pip install playwright
playwright install
playwright install-deps

Usage

Edit the thread URL in download_thread.py if needed, then run:

python download_thread.py

The script will:

Iterate through all pages of the thread
Extract posts (author, timestamp, content)
Continuously save results to thread.json

Progress is written after each page to avoid data loss if the script stops.

Output:

thread.json

Structure example:

{
  "author": "username",
  "timestamp": "2025-08-17T14:17:53.000Z",
  "content": "post text",
  "page": 5
}

Notes

The scraper uses Playwright with Chromium.
If a page fails to load, it will be skipped and the script continues.
Query of 1500 pages discussion took about 15 Minutes

TL;DR

Setup

Usage

Notes

See Also