Crawling microblogging services to gather language-classified URLs. Workflow and case study
Résumé
This paper presents a way to extract links from messages published on microblogging platforms and their classification according to the language and possible relevance of their target in order to build a text corpus. Three platforms are taken into consideration: FriendFeed, identi.ca and Reddit, as they provide a relative diversity of user profiles and, more importantly, user languages. In order to explore them, I introduce a traversal algorithm based on user pages. As I target lesser-known languages, I tried to focus on non-English posts by filtering out English text. Using mature open-source software from the NLP research field, a spell checker (aspell), and a language identification system (langid.py), my case study and benchmarks give an insight into the linguistic structure of the considered services.
Origine : Fichiers produits par l'(les) auteur(s)
Loading...