Two comparable corpora of German newspaper text gathered on the web: Bild & Die Zeit - HAL Accéder directement au contenu
Pré-publication, Document de travail Année : 2013

Two comparable corpora of German newspaper text gathered on the web: Bild & Die Zeit

Résumé

This technical report documents the creation of two comparable corpora of German newspaper text, focused on the daily tabloid Bild and the weekly newspaper Die Zeit. Two specialized crawlers and corpus builders were designed in order to crawl the domain names bild.de and zeit.de with the objective of gathering as many complete articles as possible. A high content quality was made possible by the specially designed boilerplate removal and metadata recording code. As a result, two separate corpora were created. Currently, the last version for Bild is from 2011 and the last version for Die Zeit is from early 2013. The corpora feature a total of respectively 60 476 and 134 222 articles. Whereas the crawler designed for Bild has been discontinued due to frequent layout changes on the website, the other one concerning Die Zeit is still actively maintained, its code has been made available under an open source license.
Fichier principal
Vignette du fichier
Barbaresi_comparable_newspaper_corpora.pdf ( 40.32 Ko ) Télécharger
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

halshs-00844541, version 1 (22-07-2013)

Identifiants

  • HAL Id : halshs-00844541 , version 1

Citer

Adrien Barbaresi. Two comparable corpora of German newspaper text gathered on the web: Bild & Die Zeit: Technical report. 2013. ⟨halshs-00844541⟩
263 Consultations
264 Téléchargements
Dernière date de mise à jour le 06/04/2024
comment ces indicateurs sont-ils produits

Partager

Gmail Facebook Twitter LinkedIn Plus