Data4Society Conference (Berlin)

Earlier this week, I attended the Data4Society conference in Berlin, hosted by KonsortSWD. The event brought together researchers and infrastructure providers to discuss the future of social, behavioral, and economic data. Platform data was a central theme throughout the conference and was also highlighted in the closing remarks by Christof Wolf (President of GESIS) as one of the most important future data sources for the social sciences.

The Responsible Geosocial Data Pipeline

I contributed a presentation titled “Responsible Geosocial Data Pipeline: A Privacy-by-Design Approach for Spatial Development.”

My goal was to present the infrastructural concept that emerged from my recent habilitation (submitted! Not yet done..). Integrating user-generated geosocial data into spatial planning offers an important opportunity to understand collective perception. However, the processing of this data needs challenging trade-off adjustments between scientific insight and data protection.

Title Slide: Responsible Geosocial Data Pipeline

Instead of storing raw, personally identifiable information (PII), our pipeline uses a probabilistic data structure called HyperLogLog (HLL). It immediately transforms raw data into anonymous statistical abstractions (sketches) upon ingestion. This prevents re-identification while still allowing us to perform quatitative estimations and intersection analyses (e.g., distinguishing tourists from locals, as demonstrated in our recent mapping of Germany).

Notes

The discussions following the presentation and throughout the conference were very insightful. They confirmed the need for robust data infrastructures. A few key takeaways I found particularly interesting:

  • While the EU’s Digital Services Act (DSA) theoretically grants researchers access to platform data, a presentation by LK Seiling highlighted that the DSA currently acts more like a toothless tiger. Official access routes are hampered by platform delays and restrictions. This reinforces the need for independent curated research data infrastructures that can secure and provide access to already existing, public datasets before they disappear.

  • I had great discussions with colleagues from GESIS (Katrin Weller), who recently launched a started an infrastructure project deveopment (RIDLOP) to store and provide secure access to sensitive raw platform data within Trusted Research Environments (TREs). RIDLOP focuses on secure access to raw data for complex individual analyses. This is the main difference to our approach. By abstracting data early, we aim to provide anonymous, global baseline metrics. Without these baseline metrics, it is not possible normalize local phenomena or reliably measure complex concepts in individual and local casestudies.

  • A highlight was a session by Jakob Napiontek on AI-supported data synthesis. He demonstrated how Generative Adversarial Networks (GANs) can be used to create synthetic twins of sensitive population statistics. The synthetic data imitates the statistical properties of the original dataset while guaranteeing anonymity. This approach elegantly avoids GDPR issues and is therefore very relevant for many other areas (just thinking of German health statistics and all the issues connected to its collection and sharing!).