Improving Automatic Categorization of Technical vs. Laymen Medical Words using FastText Word Embeddings

Detection of difficult for understanding words is a crucial task for ensuring the proper understanding of medical texts such as diagnoses and drug instructions. In this paper, we study usage of recently developed word embeddings, which contain context information for words together with other linguistic and non-linguistic features, for improving the detection of difficult medical words. We propose new cross-validation scenarios in order to test the generalization ability of the medical words difficulty detection from different perspectives and provide the experimental study of previously used methods for feature extraction together with recently proposed FastText embeddings. We found that for known words and unknown users FastText embeddings surely improves the detection of word understandability reaching 85.9 F-score (up to 2.9 F-score improvement).

Mots clés

text simplification difficulty detection word embeddings

Domaines

Sciences de l'information et de la communication Intelligence artificielle [cs.AI]

Liste complète des métadonnées

Format du dépôt	Fichier
Type de dépôt	Communication dans un congrès
Titre	en Improving Automatic Categorization of Technical vs. Laymen Medical Words using FastText Word Embeddings
Résumé	en Detection of difficult for understanding words is a crucial task for ensuring the proper understanding of medical texts such as diagnoses and drug instructions. In this paper, we study usage of recently developed word embeddings, which contain context information for words together with other linguistic and non-linguistic features, for improving the detection of difficult medical words. We propose new cross-validation scenarios in order to test the generalization ability of the medical words difficulty detection from different perspectives and provide the experimental study of previously used methods for feature extraction together with recently proposed FastText embeddings. We found that for known words and unknown users FastText embeddings surely improves the detection of word understandability reaching 85.9 F-score (up to 2.9 F-score improvement).
Auteur(s)	Hanna Pylieva , Artem Chernodub , Natalia Grabar ¹ , Thierry Hamon ^{2, 3} 1 STL - Savoirs, Textes, Langage (STL) - UMR 8163 ( 11909 ) - Domaine Universitaire du Pont de Bois - Batiment B4 rue du Barreau - BP 60149 - 59653 VILLENEUVE D'ASCQ CEDEX - France Université de Lille ( 374570 ) ; Centre National de la Recherche Scientifique UMR8163 ( 441569 ) 2 LIMSI - Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur ( 247329 ) - Université Paris-Sud Bât. 507 - Rue du Belvédère -91405 ORSAY CEDEX - France Université Paris-Sud - Paris 11 ( 92966 ) ; Sorbonne Université - UFR d'Ingénierie ( 408837 ) ; Sorbonne Université ( 413221 ) ; Université Paris-Saclay ( 419361 ) ; Centre National de la Recherche Scientifique UPR3251 ( 441569 ) ; Université Paris Saclay (COmUE) ( 562112 ) 3 UP13 - Université Paris 13 ( 15786 ) - France
Langue du document	Anglais
Vulgarisation	Non
Comité de lecture	Oui
Invité	Non
Audience	Internationale
Actes	Non
Titre du congrès	1st International Workshop on Informatics & Data-Driven Medicine (IDDM 2018)
Date début congrès	2018-11-28
Ville	Lviv
Pays	Ukraine
Domaine(s)	Sciences de l'Homme et Société/Sciences de l'information et de la communication Informatique [cs]/Intelligence artificielle [cs.AI]
Projet(s) ANR	Communication, Literacy, Education, Accessibility, Readability [En savoir plus] CLEAR - ANR-17-CE19-0016 AAPG2017 - 2017
Mots-clés	en text simplification, difficulty detection, word embeddings

Fichier principal

pylieva-IDDM2018.pdf ( 383.01 Ko )

Origine : Fichiers produits par l'(les) auteur(s)

Natalia Grabar : Connectez-vous pour contacter le contributeur

https://shs.hal.science/halshs-01968357

Soumis le : mercredi 2 janvier 2019 à 15:47:38

Dernière modification le : mercredi 28 février 2024 à 14:37:14

Archivage à long terme le : mercredi 3 avril 2019 à 16:09:31

Dates et versions

halshs-01968357, version 1 (02-01-2019)

Identifiants

HAL Id : halshs-01968357 , version 1

Citer

Hanna Pylieva, Artem Chernodub, Natalia Grabar, Thierry Hamon. Improving Automatic Categorization of Technical vs. Laymen Medical Words using FastText Word Embeddings. 1st International Workshop on Informatics & Data-Driven Medicine (IDDM 2018), Nov 2018, Lviv, Ukraine. ⟨halshs-01968357⟩

Exporter

BibTeX TEI Dublin Core DC Terms EndNote Datacite

Collections

UNIV-PARIS13 CNRS LIMSI STL CAMPUS-AAR AAI USPC UNIV-PARIS-SACLAY UNIV-LILLE SORBONNE-UNIVERSITE SORBONNE-PARIS-NORD ANR LISN GS-ENGINEERING GS-COMPUTER-SCIENCE GS-SPORT-HUMAN-MOVEMENT ACT-R

834 Consultations

344 Téléchargements

Dernière date de mise à jour le 20/04/2024