TEI in CLAPI Project, Corpus of Spoken Language in Interaction - HAL-SHS - Sciences de l'Homme et de la Société Accéder directement au contenu
Communication Dans Un Congrès Année : 2009

TEI in CLAPI Project, Corpus of Spoken Language in Interaction

Résumé

CLAPI Project

CLAPI is an on line multimedia workbench dedicated to corpora of spoken language in interaction, composed of a databank and a set of search tools, available on http:clapi.univ-lyon2.fr. Data accessible in the databank are constituted by audio and video recordings of “naturally occurring” social interactions, collected in various situations (ordinary private conversations, professional, institutional, commercial, medical, school interactions....). They have been collected from the beginning of ‘80s, with transcripts oriented towards interactional analysis. Today, the database is composed by 43 corpus, 300 records, 500 transcripts, 50 hrs aligned in XML format and 10 hrs freely downloadable. They are fully described with 75 metadata with fixed fields, free fields, additional files to give more information in free format, scans of documents, and images.
Corpora belong to the ICAR research lab but also to several other research labs and academic partners who are interested in archiving their data, in making them available to the research community and in using search tools for their analysis.
Search tools include typical functions (concordance program, search for co-occuring words, automatic identification of repetitions, lexicon, frequencies and syntheses of the statistical overall features of a transcript) as well as specific functions able to manage interactional phenomena such as overlap, pause, size of the turn or position within the turn. The databank makes available simple search tools as well as complex multi-criteria request tools combining metadata, lexical forms and interactional phenomena. Hits and results are shown with the corpus metadata, the transcript's extract, as well as the corresponding audio or video signal (by streaming).
The project teamICOR working group : Michel Bert, Sylvie Bruxelles, Carole Etienne, Emilie Jouin, Lorenza Mondada, Christian Plantin, Sandra Teston-Bonnard, Véronique Traverso, Daniel Valero TEI
Presentation in Lyon in April the 1st Carole ETIENNE Scientific and editorial Aims
The CLAPI workbench has several objectives
* a patrimonial goal in archiving corpora,
* a diffusion goal: to make corpora available on line and exchange data with other databanks (TALKBANK, VALIBEL, IDS DATABANK, etc.)
* a main scientific goal: to make specialized tools available in order to help researchers in their interactional and linguistic analysis on big amounts of data.
Project dates: the workbench is on line since 2005 and it is regularly updated.
Web links
* CLAPI Workbench http:clapi.univ-lyon2.fr
* Website dedicated to interactional corpus linguistics: http:icar.univ-lyon2.fr/projets/corinte/
TEI project
We have developed our own XML DTD in CLAPI but we have delivered our metadata and the contents of the transcripts in TEI since 2006 (project P5). In december 2006, we had the opportunity to meet Lou Burnard who gave us valuable informations on how to manage some specific fields. Since then, we collected new data, managed and exchanged them in dedicated virtual workspaces, and we plan to generalize the use of TEI in our ANR projects (e.g. CIEL, SPIM etc.). We are also interested in using TEI on transcripts including code switching and new annotated phenomena like gestures and multimodality.
Needs in TEI extension
We have identified two kinds of needs which specifically concern interactional data :
* metadata: We are interested in new elements in TeiHeader to describe the interactional situation, to reduce the number of links to our database descriptors and to avoid generic tags like ‘extent' or ‘p'. We are interested in new elements in the definition of the participants too (e.g. mother's/father's language, places where speaker has lived, other linguistic biographical informations, as well as the relationship between participants).
* transcription content: in order to preserve the transcriptor's work, we decided to keep both the original transcript format and the generic transcribed phenomena. The first one is useful to display the transcripts, and the second to identify various forms of a token to be recognized by the search tools. As we manage corpora collected by different research teams, we handle several transcript conventions, so we need to keep the very sign or set of signs used by them. Currently, we add the property ‘rend' in pause, vocal or shift tags and use ‘orig' tag with ‘reg' property for the variable transcripted forms of a given token. In the same way, even if we use the ‘anchor' tag to align overlaps, we need to keep the original overlap sign which can differ according to the conventions used. For example, we can find in the same transcript: * (.) and (..) which both mean a short pause but not of the same lenght; in TEI we add the ‘rend' property to keep these signs : and
* Another example concerns modifications of the lengthening sign to express its duration. For example, euh: or euh::: are different forms of the euh with a different vowel lengthening: euh: differs from euh::
Fichier non déposé

Dates et versions

halshs-00376951 , version 1 (20-04-2009)

Identifiants

  • HAL Id : halshs-00376951 , version 1

Citer

Carole Etienne. TEI in CLAPI Project, Corpus of Spoken Language in Interaction. TEI in CLAPI Workbench, Apr 2009, Lyon, France. ⟨halshs-00376951⟩
243 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More