Multilingual scraper of privacy policies and terms of service

March, 2025·

David Bernhard

Luka Nenadic

Stefan Bechtold

Karel Kubicek

Abstract

Websites’ privacy policies and terms of service constitute valuable resources for scholars in various disciplines. Nonetheless, there exists no large, multilingual database collecting these documents over the long term. Therefore, researchers spend a lot of valuable time collecting them for individual projects, and these heterogeneous methods impede the reproducibility and comparability of research findings. As a solution, we introduce a long-term scraper of privacy policies and terms supporting 37 languages. We run our scraper on a monthly basis on 800 000 websites, and we publish the dataset for the twelve crawls in 2024. Our manual evaluation of the end-to-end extraction of the documents demonstrates F1 scores of 79% for privacy policies and 75% for terms of service in five sample languages (English, German, French, Italian, and Croatian). We present several broad potential applications of our database for future research.

Type

Conference paper

Publication

Proceedings of the Symposium on Computer Science and Law, 55–63

Last updated on March, 2025

Authors

Luka Nenadic

PhD Student

← SwiLTra-Bench: The Swiss legal translation benchmark July, 2025

Schweizer DMA-Brussels-Effect? Wie Gatekeeper den DMA in der Schweiz (nicht) umsetzen January, 2025 →