Coreference annotation with SACR, a new drag-and-drop based tool: Annotation of co-reference with SACR, a new “drag- and-drop” tool

Abstract : Statistical methods in automatic language processing leads to an increased need for manually annotated corpora. But carefully annotated resources are costly. This is especially the case for corpora annotated with coreference chains (sets of all the linguistic expressions that refer to the same referent). It is thus necessary to look for the annotation strategy that requires the least effort from the annotator. Furthermore, since annotation of large corpora are often done by students, interns or non-technical users, the tool must be ready-to-use and the interface needs to be intuitive, without requiring a long training time. SACR is a new coreference chain annotation tool that has been developed with this idea in mind. Its user interface has been specifically designed to facilitate and speed up the annotation process. Coreference chain annotation requires at least two stages: delimiting and marking referring expressions (linguistic expressions that refer to an entity in the extralinguistic world); and linking coreferential expressions to build the chains. The first stage is done in SACR simply by clicking on the first and last tokens (either words or characters, depending on the needs or on the language) of the expression. For the second stage, most of the existing tools (e.g. Glozz [1]) require to define a set in advance for each chain, but a better strategy is to record the referent name for each referring expression; chains are computed afterwards, automatically: expressions with the same referent name are put in the same chain. This is the method used in the "Democrat" project, with TXM [2] and Analec [3]. SACR implements the second approach but let the user make coreference relations in the spirit of the first, so that the annotator is not required to type the referent name for each expression: a drag-and-drop operation is sufficient to copy the referent name to another expression. Shortcuts allow features to be annotated for each expression: the user has to press a key (e.g. "d" for a noun with a definite article) to set the feature; the program then goes automatically to the next expression. A keystroke is thus enough to annotate a feature. SACR has no module for automatic annotation, but its simple text data format allows easy conversion to and from other tools like chunkers or taggers, so that parts of speech, for example, can be easily added automatically outside SACR, and then checked in SACR. Visualization is an important part of SACR: marked expressions are surrounded by colored framed (one color per chain), allowing to view several levels of nested expressions. The user can search through a list of all the referents and expressions already annotated. This is helpful to link expressions that are coreferential but distant. Helper scripts have been written to convert to and from other common formats like Glozz or CONLL2011. This is necessary since SACR is dedicated to annotation: the user is expected to use other tools to perform analysis of the data. Written in HTML, CSS and JS, SACR is implemented as a simple web page. It is usable online ( and downloadable. It is open source and distributed under the terms of the MPL-2.0. [1] Widlöcher A., and Mathet Y. (2012). The Glozz Platform. In Proceedings of the 2012 ACM symposium. [2] Heiden, S. (2010). The TXM Platform. In 24th Pacific Asia Conference on Language, Information and Computation [3] Landragin, F., Poibeau, T., and Victorri, B. (2012). Analec. In Proceedings of LREC'12.
