Corpus Annotation Project

Introduction Personnel Corpus Annotation Project buffer
Department of Linguistics
Georgetown University
Washington, DC 20057
Phone: (202) 687-5949


The GME Corpus Annotation Project sought to find all instances of modal language in a pre-determined corpus of English language data, and sesequently annotate these modals for a variety of linguistic variables.

  • The corpus chosen was the MPQA opinion corpus, a collection of English language news articles.
  • The annotation program used was MMAX2
  • The adjudication program used was developed in-house, by Dan Simonson
Modal Criteria

There were two basic criteria that annotators used to determine if a token counted as a modal.

  1. A modal target was a predicate that had (i) a propositional argument, and (ii) a semantic relation to alternative ways the world could be.
  2. If a propositional attitude verb (or a noun derived from it) had a specified attitude holder in the text, it did not count as a target. If the attitude holder was implicit, the verb (or noun) counted as a target.
Modal Attributes

The modal tokens were annotated for the following attributes:

  • lemma: the modal's lemma form (e.g. the lemma for easiest would be easy)
  • modality type (i.e. modal flavor) and subtype (if relevant)
  • prejacent: the textual span (a sentence or clause) that indicates the proposition to which the modal applies
  • predication type: whether the modal is simple (positive), comparative/equative, or superlative
  • degree indicator: a textual span that indicates the degree of the prejacent on the modal scale
  • background: a sequence of one or more constituents that provide textual description of the circumstances and/or priorities that the modal claim is based on; important contextual information that doesn’t fall into one of the other categories
  • modified element: a noun or adjective modified by a target modal (and is part of the modal's prejacent)
  • environmental polarity: whether or not the modal is in the semantic scope of negation
  • source: the entity that had ability or knowledge that is the basis for the modal claim
  • outscoping quantifier: an outscoping quantifier falls in the prejacent of a modal, but is interpreted as taking scope over the modal
Initial Results
(As of May 2016)

Modal tokens were found in 96% (515/534) of texts in the corpus. The total number of modal tokens annotated was 7982, spread across 1290 different lemmas.

Next Steps

The corpus is currently being transitioned to a multi-layer  format with more advanced search capabilities.  When that transition is complete, the corpus will be made available for other researchers through Georgetown's corpus linguistics servers.  Additionally, users will be able to report errors through GitHub, and the GME team will push out periodic updates.