Accessing information silos with playwright
Social discourse is increasingly siloed behind paywalls, login barriers, and other difficult-to-access restrictions. Conversely, there is a growing need to summarize communication patterns to understand and monitor social cohesion and use this knowledge to protect and strengthen democracy.
This was a test to expand the knowledge beyond the big networks (Reddit, Facebook, Instagram), to also include smaller boards of communication and discourse on particular topics.
I wanted to archive a long discussion thread from photovoltaikforum.com for a test LLM analysis. The thread contains more than 1000 pages and cannot be easily exported. Simple HTTP scraping does not work because the forum requires a fully rendered browser session.
The solution was to use Playwright with Python.
The script opens the thread in a headless browser, iterates through all pages and extracts:
- author
- timestamp
- post content
- page number
The results are streamed into a JSON file so progress is not lost if the script stops.
TL;DR
Here is the script:
import asyncio
import json
from pathlib import Path
from playwright.async_api import async_playwright
THREAD_URL = "https://www.photovoltaikforum.com/thread/241171-marstek-venus-c-e-ac-speicher-5-12-kwh-erfahrungen-installation-leistung-im-allt/"
OUTPUT_MD = "thread.md"
OUTPUT_JSON = "thread.json"
OUTPUT_JSON = "thread.json"
async def scrape_thread():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
page_number = 1
max_pages = 1481
# Open JSON file in streaming mode
json_path = Path(OUTPUT_JSON)
# If file exists, load previous progress
if json_path.exists():
with open(json_path, "r", encoding="utf-8") as f:
all_posts = json.load(f)
else:
all_posts = []
while page_number <= max_pages:
url = f"{THREAD_URL}?pageNo={page_number}"
print(f"Scraping page {page_number}")
try:
await page.goto(url, wait_until="domcontentloaded", timeout=60000)
except Exception as e:
print(f"Page {page_number} failed: {e}")
page_number += 1
continue
await page.wait_for_selector(".message", timeout=30000)
posts = await page.locator(".message").all()
page_posts = []
for post in posts:
try:
author_locator = post.locator(".username").first
if await author_locator.count() > 0:
author = (await author_locator.inner_text()).strip()
else:
author = "unknown"
times = await post.locator("time").all()
timestamp = "unknown"
for t in times:
dt = await t.get_attribute("datetime")
if dt:
timestamp = dt
break
content_locator = post.locator(".messageContent").first
content = ""
if await content_locator.count() > 0:
content = (await content_locator.inner_text()).strip()
page_posts.append({
"author": author.strip(),
"timestamp": timestamp,
"content": content.strip(),
"page": page_number
})
except Exception as e:
print(f"Skipped broken post on page {page_number}: {e}")
# stream save results to file
all_posts.extend(page_posts)
with open(json_path, "w", encoding="utf-8") as f:
json.dump(all_posts, f, indent=2, ensure_ascii=False)
print(f"Saved page {page_number}")
page_number += 1
await browser.close()
return
def save_markdown(posts):
with open(OUTPUT_MD, "w", encoding="utf-8") as f:
for post in posts:
f.write(f"## {post['author']}\n")
f.write(f"*{post['timestamp']}*\n\n")
f.write(post["content"])
f.write("\n\n---\n\n")
def save_json(posts):
with open(OUTPUT_JSON, "w", encoding="utf-8") as f:
json.dump(posts, f, indent=2, ensure_ascii=False)
async def main():
posts = await scrape_thread()
print(f"Collected {len(posts)} posts")
save_markdown(posts)
save_json(posts)
print("Saved thread.md and thread.json")
if __name__ == "__main__":
asyncio.run(main())
Setup
Requirements:
- Linux / WSL2 recommended
- Python 3.8–3.11
- Playwright
Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate
Install dependencies:
pip install playwright
playwright install
playwright install-deps
Usage
Edit the thread URL in download_thread.py if needed, then run:
python download_thread.py
The script will:
- Iterate through all pages of the thread
- Extract posts (author, timestamp, content)
- Continuously save results to
thread.json
Progress is written after each page to avoid data loss if the script stops.
Output:
thread.json
Structure example:
{
"author": "username",
"timestamp": "2025-08-17T14:17:53.000Z",
"content": "post text",
"page": 5
}
Notes
- The scraper uses Playwright with Chromium.
- If a page fails to load, it will be skipped and the script continues.
- Query of 1500 pages discussion took about 15 Minutes