The GME Corpus
Annotation Project sought to find all instances of modal
language in a pre-determined corpus of English language data, and
sesequently annotate these modals for a variety of linguistic variables.
corpus chosen was the MPQA opinion corpus,
a collection of English language news articles.
annotation program used was MMAX2
adjudication program used was developed in-house, by Dan Simonson
There were two basic
criteria that annotators used to determine if a token counted as a
- A modal target was
a predicate that had (i) a propositional argument, and (ii) a semantic
relation to alternative ways the world could be.
propositional attitude verb (or a noun derived from it) had a specified
attitude holder in the text, it did not count as a target. If the
attitude holder was implicit, the verb (or noun) counted as a target.
The modal tokens were
annotated for the following attributes:
- lemma: the modal's lemma form (e.g.
the lemma for easiest would
- modality type (i.e. modal flavor)
and subtype (if relevant)
- prejacent: the textual span (a
sentence or clause) that indicates the proposition to which the modal
- predication type: whether the modal
is simple (positive), comparative/equative, or superlative
- degree indicator: a textual span
that indicates the degree of the prejacent on the modal scale
- background: a sequence of one or
constituents that provide textual description of the circumstances
and/or priorities that the modal claim is based on; important
contextual information that doesn’t fall into one of the other
- modified element: a noun or
adjective modified by a target modal (and is part of the modal's
- environmental polarity: whether or
not the modal is in the semantic scope of negation
- source: the entity that had ability
or knowledge that is the basis for the modal claim
- outscoping quantifier: an outscoping
quantifier falls in the prejacent of a modal, but is interpreted as
taking scope over the modal
(As of May
Modal tokens were found
in 96% (515/534) of texts in the corpus. The total number of modal
tokens annotated was 7982, spread across 1290 different lemmas.
The corpus is currently
being transitioned to a multi-layer format with more advanced
search capabilities. When that transition
is complete, the corpus will be made available for other researchers
through Georgetown's corpus linguistics servers. Additionally,
users will be able to report errors through GitHub, and the GME team
will push out periodic updates.