Challenges in the linguistic exploitation of specialized republishable web corpora

Adrien Barbaresi

Résumé

I would like to present work on texts corpora in German, gathered on the Web and processed in order to be made available to linguists and a broader user community via a web interface. The corpora are specialized in the sense that they only address a particular text genre or source at a time. Web crawling techniques are used to download the documents, then they are stored roughly in the way web archives do. More precisely, I would like to talk about two cases where texts are expected to be republishable: a "standard" case, political speeches, and a "borderline" case, German blogs under CC license. The work is performed in the context of a digital dictionary of German. The primary user base consists of lexicographers, who need valuable or at least exploitable evidence, in the form of precise quotes or definition elements. The actual gathering and processing of the corpora is described elsewhere (anonymized references). In this talk I would like to focus on a series of challenges that are to be solved in order to make data from web archives accessible to researchers and to study web text corpora: metadata extraction, quality assurance, licensing, and "scientificity". 1. A proper metadata extraction is needed in order to make further downstream applications possible. It has to be performed meticulously, since experience shows that even small or rare mistakes in date encoding for instance may cause the application to be disregarded or discarded by researchers in the humanities, since linguistic trends cannot be identified properly if the content is not ordered in time. Easily available metadata in the case of speeches constrast with different content types, encodings, and markup patterns concerning the blogs. Compromises have to be made without sacrificing recall, since republishable texts are rather rare. 2. Regarding the content, quality assurance is paramount, since a high quality is expected by users, all the more since they may feel reluctant to use web texts for their studies. In fact, providing "Hi-Fi" web corpora also means promoting the cause of web sources and modernization of research methodology. 3. The results are hosted in Germany, and thus German copyright laws apply, which can be considered to be more restrictive than others. Additionally, there are a number of issues with licensing in general and CC licenses in particular, even with manual verification: the CC ND and (to a lesser extent) NC predicates can hinder proper republication. There are also potential copyright issues regarding blog comments. To sum up the issues described above, much work flows into ensuring the "scientificity" of web texts and making the texts not only available but also citable in a scholarly sense.

Challenges in the linguistic exploitation of specialized republishable web corpora

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Partager