Crawling microblogging services to gather language-classified URLs. Workflow and case study - HAL-SHS - Sciences de l'Homme et de la Société Accéder directement au contenu
Communication Dans Un Congrès Année : 2013

Crawling microblogging services to gather language-classified URLs. Workflow and case study

Résumé

This paper presents a way to extract links from messages published on microblogging platforms and their classification according to the language and possible relevance of their target in order to build a text corpus. Three platforms are taken into consideration: FriendFeed, identi.ca and Reddit, as they provide a relative diversity of user profiles and, more importantly, user languages. In order to explore them, I introduce a traversal algorithm based on user pages. As I target lesser-known languages, I tried to focus on non-English posts by filtering out English text. Using mature open-source software from the NLP research field, a spell checker (aspell), and a language identification system (langid.py), my case study and benchmarks give an insight into the linguistic structure of the considered services.
Fichier principal
Vignette du fichier
Barbaresi_ACL-SRW_13_final_v2.pdf (97.93 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

halshs-00840861 , version 1 (05-07-2013)
halshs-00840861 , version 2 (05-08-2014)

Identifiants

  • HAL Id : halshs-00840861 , version 2

Citer

Adrien Barbaresi. Crawling microblogging services to gather language-classified URLs. Workflow and case study. Annual Meeting of the Association for Computational Linguistics, Aug 2013, Sofia, Bulgaria. pp.9-15. ⟨halshs-00840861v2⟩
423 Consultations
389 Téléchargements

Partager

Gmail Facebook X LinkedIn More