Crawling microblogging services to gather language-classified URLs. Workflow and case study

Adrien Barbaresi

Communication Dans Un Congrès Année : 2013

Crawling microblogging services to gather language-classified URLs. Workflow and case study

(1)

Adrien Barbaresi

Fonction : Auteur
PersonId : 1134
IdHAL : adrien-barbaresi
ORCID : 0000-0002-8079-8694

Interactions, Corpus, Apprentissages, Représentations

Résumé

This paper presents a way to extract links from messages published on microblogging platforms and their classification according to the language and possible relevance of their target in order to build a text corpus. Three platforms are taken into consideration: FriendFeed, identi.ca and Reddit, as they provide a relative diversity of user profiles and, more importantly, user languages. In order to explore them, I introduce a traversal algorithm based on user pages. As I target lesser-known languages, I tried to focus on non-English posts by filtering out English text. Using mature open-source software from the NLP research field, a spell checker (aspell), and a language identification system (langid.py), my case study and benchmarks give an insight into the linguistic structure of the considered services.

Mots clés

microtext web corpus social networks web crawling language identification

Domaines

Linguistique Informatique et langage [cs.CL] Web

Fichier principal

Barbaresi_ACL-SRW_13_final_v2.pdf (97.93 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Adrien Barbaresi : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-00840861

Soumis le : mardi 5 août 2014-15:28:04

Dernière modification le : vendredi 12 mai 2023-04:09:45

Archivage à long terme le : mercredi 26 novembre 2014-00:32:41

Dates et versions

halshs-00840861 , version 1 (05-07-2013)

halshs-00840861 , version 2 (05-08-2014)

Identifiants

HAL Id : halshs-00840861 , version 2

Citer

Adrien Barbaresi. Crawling microblogging services to gather language-classified URLs. Workflow and case study. Annual Meeting of the Association for Computational Linguistics, Aug 2013, Sofia, Bulgaria. pp.9-15. ⟨halshs-00840861v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS UNIV-LYON2 ICAR UDL

423 Consultations

389 Téléchargements

Crawling microblogging services to gather language-classified URLs. Workflow and case study

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager