Skip to Main content Skip to Navigation
Conference papers

ENGLAWI: From Human- to Machine-Readable Wiktionary

Abstract : This paper introduces ENGLAWI, a large, versatile, XML-encoded machine-readable dictionary extracted from Wiktionary. ENGLAWI contains 752,769 articles encoding the full body of information included in Wiktionary: simple words, compounds and multiword expressions, lemmas and inflectional paradigms, etymologies, phonemic transcriptions in IPA, definition glosses and usage examples, translations, semantic and morphological relations, spelling variants, etc. It is fully documented, released under a free license and supplied with G-PeTo, a series of scripts allowing easy information extraction from ENGLAWI. Additional resources extracted from ENGLAWI, such as an inflectional lexicon, a lexicon of diatopic variants and the inclusion dates of headwords in Wiktionary's nomenclature are also provided. The paper describes the content of the resource and illustrates how it can be -and has been- used in previous studies. We finally introduce an ongoing work that computes lexicographic word embeddings from ENGLAWI's definitions.
Document type :
Conference papers
Complete list of metadata

Cited literature [49 references]  Display  Hide  Download

https://halshs.archives-ouvertes.fr/halshs-02928574
Contributor : Franck Sajous <>
Submitted on : Tuesday, September 8, 2020 - 7:38:30 PM
Last modification on : Tuesday, November 10, 2020 - 12:13:55 PM
Long-term archiving on: : Wednesday, December 2, 2020 - 4:59:18 PM

File

SajousEtAl2020_LREC_ENGLAWI.pd...
Publisher files allowed on an open archive

Identifiers

  • HAL Id : halshs-02928574, version 1

Citation

Franck Sajous, Basilio Calderone, Nabil Hathout. ENGLAWI: From Human- to Machine-Readable Wiktionary. 12th Conference on Language Resources and Evaluation (LREC 2020), May 2020, Marseille, France. pp.3016-3026. ⟨halshs-02928574⟩

Share

Metrics

Record views

59

Files downloads

58