HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Journal articles

Mesures et savoirs : Quelles méthodes pour l’histoire culturelle à l’heure du big data ?

Abstract : Quantitative analysis of cultural history has begun with the appearance of massive open-source data, such as Google Books, and has been renown as "cultural economicsˮ. It is now open to researchers and literary critics, thus allowing to have access to cultural facts and their evolution through textual marks within digitalized data. Those massive corpora cannot be analyzed blindly as they may not all be equipped with substantial metadata, or might, in worst case scenarios, be very noisy. For massive corpora, that is to say with billions of words, common visualization tools such as Voyant Tools or TXM, and the methods those softwares use to analyze data, cannot be reliably efficient. Within the margins of a project about literary History, between the Labex OBVIL and the Stanford Literary Lab, aiming at defining literature as a word, concept and semantic field, and at drawing an empirical history of literature, we analyzed 1618 French books, that is to say a 140 million word corpus, from the end of the "Ancien Régime" up to the Second World War. To do so, we used different experimental text mining techniques, combining distant and close reading analysis. In this article, we shall explore different kinds of text mining, such as (frequencial) closed measures, unsupervised machine analysis (topic model-ing), semi-open methods (collocations), each time pointing out their benefits and drawbacks. We shall then demonstrate how necessary it is to apply to a deeper and more precise text mining, using substantial metadata, such as lemmatized data, syntactical structure and semantic analysis (such as word vectors). We shall in the end demonstrate how a substantial study of big literary corpora cannot disjoint distant and close reading, as both tend to prove or contradict one another in a most effective way for producing evolutive representations of the history of literature.
Complete list of metadata

Cited literature [15 references]  Display  Hide  Download

Contributor : Alexandre Gefen Connect in order to contact the contributor
Submitted on : Monday, January 13, 2020 - 12:07:26 PM
Last modification on : Thursday, March 17, 2022 - 10:08:46 AM
Long-term archiving on: : Tuesday, April 14, 2020 - 12:44:38 PM


Semiotica_GEFEN_REBOUL (1).pdf
Files produced by the author(s)




Marianne Reboul, Alexandre Gefen. Mesures et savoirs : Quelles méthodes pour l’histoire culturelle à l’heure du big data ?. Semiotica, De Gruyter, 2019, 2019 (230), pp.97-120. ⟨10.1515/sem-2018-0103⟩. ⟨halshs-02430078⟩



Record views


Files downloads