Ellogon shares the same data model as the TIPSTER architecture. Due to this, it shares some basic features with other TIPSTER-based infrastructures, such as GATE. However, it also offers a large number of features that differentiate it from such infrastructures.

The central element for storing data in Ellogon is the Collection. A collection is a finite set of Documents. An Ellogon document consists of textual data as well as linguistic information about the textual data. This linguistic information is stored in the form of attributes and annotations.

An attribute associates a specific type of information with a typed value. An annotation associates arbitrary information (in the form of attributes) with portions of textual data. Each such portion, named span, consists of two character offsets denoting the start and the end characters of the portion, as measured from the first character of some textual data. Annotations typically consist of four elements:

  • A numeric identifier. This identifier is unique for every annotation within a document and can be used to unambiguously identify the annotation.
  • A type. Annotation types are textual values that are used to classify annotations into categories.
  • A set of spans that denote the range of the annotated textual data.
  • A set of attributes. These attributes usually encode the necessary linguistic information.