Challenges in web corpus construction for low-resource languages in a post-BootCaT world - HAL Accéder directement au contenu
Communication dans un congrès Année : 2013

Challenges in web corpus construction for low-resource languages in a post-BootCaT world

Résumé

The state of the art tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, this querying process has become very slow or impossible to perform on a low budget. In order to find reliable data sources for Indonesian, I perform a case study of different kinds of URL sources and crawling strategies. First, I classify URLs extracted from the Open Directory Project and Wikipedia for Indonesian, Malay, Danish, and Swedish in order to enable comparisons. Then I perform web crawls focusing on Indonesian and using the mentioned sources as the start URLs. My scouting approach using open-source software results in a URL database with metadata which can be used to replace or at least to complement the BootCaT approach.
Fichier principal
Vignette du fichier
Barbaresi_LTC13_Challenges-LRL_paper_v2.pdf ( 77.8 Ko ) Télécharger
Barbaresi_LTC13_Challenges-LRL_slides.pdf ( 263.86 Ko ) Télécharger
Origine : Fichiers produits par l'(les) auteur(s)
Format : Autre
Loading...

Dates et versions

halshs-00919410, version 1 (16-12-2013)
halshs-00919410, version 2 (05-08-2014)

Identifiants

  • HAL Id : halshs-00919410 , version 2

Citer

Adrien Barbaresi. Challenges in web corpus construction for low-resource languages in a post-BootCaT world. 6th Language & Technology Conference, Less Resourced Languages special track, Dec 2013, Poznan, Poland. pp.69-73. ⟨halshs-00919410v2⟩
377 Consultations
1073 Téléchargements
Dernière date de mise à jour le 20/04/2024
comment ces indicateurs sont-ils produits

Partager

Gmail Facebook Twitter LinkedIn Plus