Ellogon is a multi-lingual, cross-platform, general-purpose language engineering environment, developed in order to aid both researchers who are doing research in computational linguistics, as well as companies who produce and deliver language engineering systems. Ellogon as a language engineering platform offers an extensive set of facilities, including tools for processing and visualising textual/HTML/XML data and associated linguistic information, support for lexical resources (like creating and embedding lexicons), tools for creating annotated corpora, accessing databases, comparing annotated data, or transforming linguistic information into vectors for use with various machine learning algorithms.
During the last decade, a large number of software infrastructures aiming at facilitating R&D in the field of natural language processing have been presented. Some of these infrastructures, such as LT-NSL/LT-XML tools or GATE, have become extremely popular as they have been applied to a wide range of tasks by many institutions around the world.
Ellogon belongs to the category of referential or annotation based platforms, where the linguistic information is stored separately from the textual data, having references back to the original text. Based on the TIPSTER data model, Ellogon provides infrastructure for:
- Managing, storing and exchanging textual data as well as the associated linguistic information.
- Creating, embedding and managing linguistic processing components.
- Facilitating communication among different linguistic components by defining a suitable programming interface (API).
- Visualising textual data and associated linguistic information.
Ellogon can be used either as an NLP integrated development environment (IDE) or as a library that can be embedded to foreign applications. To achieve this, Ellogon proposes and implements a modular architecture with four independent subsystems:
- A highly efficient core developed in C, which implements an extended version of the TIPSTER data model. Its main responsibility is to manage the storage of the textual data and the associated linguistic information and to provide a well-defined programming interface (API) that can be used in order to retrieve/modify the stored information.
- An object oriented C++ API which increases the usability of the C core API. This object oriented API is exposed in a wide range of programming languages, including C++, Java, Tcl, Perl and Python.
- An extensive and easy to use graphical user interface (GUI). This interface can be easily tailored to the needs of the end user.
- A modular pluggable component system. All linguistic processing within the platform is performed with the help of external, loaded at runtime, components. These components can be implemented in a wide range of programming languages, including C, C++, Java, Tcl, Perl and Python.
Ellogon shares the same data model as the TIPSTER architecture. Due to this, it shares some basic features with other TIPSTER-based infrastructures, such as GATE. However, it also offers a large number of features that differentiate it from such infrastructures.
The central element for storing data in Ellogon is the Collection. A collection is a finite set of Documents. An Ellogon document consists of textual data as well as linguistic information about the textual data. This linguistic information is stored in the form of attributes and annotations.
An attribute associates a specific type of information with a typed value. An annotation associates arbitrary information (in the form of attributes) with portions of textual data. Each such portion, named span, consists of two character offsets denoting the start and the end characters of the portion, as measured from the first character of some textual data. Annotations typically consist of four elements:
- A numeric identifier. This identifier is unique for every annotation within a document and can be used to unambiguously identify the annotation.
- A type. Annotation types are textual values that are used to classify annotations into categories.
- A set of spans that denote the range of the annotated textual data.
- A set of attributes. These attributes usually encode the necessary linguistic information.
The motivation behind the development of Ellogon (which started in 1998) was the inadequacy of existing platforms to support, at that time, some essential properties, such as the ability to
- support a wide range of languages through Unicode,
- function under all major operating systems,
- have as few hardware requirements as possible in terms of processing speed and memory usage,
- be based on an embeddable - decomposable architecture that enables parts to be embedded in other systems and
- provide an extensible, easy to use and powerful user interface.
Ellogon in its present form satisfies all of these requirements. As Ellogon is based on the TIPSTER architecture, it shares many basic properties with other TIPSTER-based infrastructures like GATE. However, Ellogon offers several important features that differentiate it from similar infrastructures:
- Easy Component Development
It is fairly easy to understand the process of developing new components and develop them using the functionalities provided by Ellogon. Additionally, a wide range of programming languages for component development are supported, including C, C++, Java, Tcl, Perl and Python.
- Integrated Development Environment
Ellogon operates as an integrated development environment, as it provides complete support to the development cycle of a component. Components can be created, edited, compiled and linked (whether applicable) from inside Ellogon. Furthermore, C/C++/Java components can be unloaded, modified, compiled and reloaded into Ellogon without having to quit from Ellogon. The ability to unload or reload all components is essential as it can significantly reduce development cycle, since component modifications can be immediately evaluated.
- A ready to use component "toolbox"
Ellogon is equipped with a large number of ready-to-use tools for performing tasks like annotated corpora creation, vector generation or data comparison. Additionally, several sample components are provided that can be adapted to various domains and languages, which perform some basic tasks like tokenization, part-of-speech tagging or gazetteer list lookup. Finally, Ellogon offers several data visualisation tools, ranging from simple viewers for the annotation database to viewers able to display hierarchical information, like syntax trees.
- Easy deployment
As Ellogon implements a decomposable architecture, it is extremely easy to create an easy to use product from a set of components that perform a specific task. All the components along with the needed Ellogon parts can be packaged either in a single executable (which needs no installation) or as an application (which can be ran unmodified under multiple operating systems). These specialised applications can be distributed and used in any system, even if Ellogon has not been installed to the system.
- Facilitating Computational Linguists
Ellogon tries to facilitate many aspects of the tasks computational linguists usually perform within the platform, especially if the task involves annotated corpora creation, linguistic processing component adaptation or various evaluation tasks. Providing a wide range of highly customisable and easy to use annotation tools, Ellogon is an ideal environment for annotated corpora construction. Available annotators support regular marking (e.g. part of speech tagging or named entities annotation) as well as annotation of hierarchically information (i.e. syntactic relation annotation) on plain as well as HTML corpora. (Two annotation tools are shown here and here).
Adapting linguistic processing components into a new domain is another frequent task. Usually it involves modifications to domain specific resources used internally by the processing components. Ellogon facilitates the adaptation process as the modified component can be applied immediately and the user can very easily identify the effect of his/her modifications, through the comparison facilities offered by the platform. Ellogon provides significant infrastructure for comparing the linguistic information associated with the textual data. The Collection Comparison tool (figure 1, figure 2) can be used for comparing the linguistic information stored in a set (or collection) of documents. Various constraints regarding the information that will be compared can be specified through the graphical user interface of the comparison tool and the comparison results are presented by utilising standard figures, like recall, precision and F-measure. Additionally, the comparison tool can present a comparison log. This log is a graphical representation of the differences found during the comparison process and can provide valuable help to the user in order to locate and possibly correct the errors.
- Facilitating Language Engineers
One of the most frequent tasks performed by language engineers inside Ellogon is of course the development of processing components. Significant infrastructure is provided in order to facilitate component development, from the very first step of writing the component to ensuring that the component works as expected. Operating as an integrating environment (IDE), Ellogon allows the creation of components in a wide range of programming languages (C, C++, Tcl, Java, Perl, Python): all the needed code of the component structure is automatically generated during the initial construction of a component while a component can be compiled, linked, loaded and tested from inside Ellogon. For some specific languages (all supported ones except Java) a component can be even unloaded, modified, compiled and reloaded, in order to quickly test the effect of desired modifications.
Developing components for Ellogon is a fairly easy process, as a high level API is provided both as a set of functions or as an object oriented hierarchy of classes, if the programming language allows it. Additionally, Ellogon is distributed with a small set of components whose source code can be used as an example on how to perform some commonly needed tasks.
The fact that almost everything in Ellogon is defined in terms of components, offers a large degree of flexibility to component developers. Combined with its modular architecture, Ellogon offers the ability to be tailored in order to meet specific needs. For example, particular Ellogon parts can be wrapped along with specific processing components to form a stand-alone application that performs a specific processing task (having possibly a specifically-made graphical interface). Such an application will even ran without requiring the installation of Ellogon.
- Facilitating end users
End users of Ellogon can be roughly distinguished in two categories: users that use applications or services based on Ellogon and users that use Ellogon as a "black box" in order to process corpora and collect the results.
Regarding the first category of end users, Ellogon provides many facilities for creating stand alone applications with customised graphical interfaces that are extremely easy to use. Such an application is shown in this figure, where all the complexity of creating collections, applying the required processing components and exporting the processing results is hidden behind a simple graphical interface. In addition to creating specialised applications, Ellogon can be instrumented through the use of services, like ActiveX, DDE, HTTP or SOAP, which allow other applications to use Ellogon facilities in a way transparent to the end user.
The second category of end users characterises users who want to perform some sort of linguistic processing by simply applying the components available through Ellogon on a corpus. For this category of users, Ellogon is a toolbox of "black boxes": for example users may want to apply a named-entity recognition system operating within Ellogon or use more primitive components like a syntactic analyser. Ellogon tries to facilitate this category of users by providing an easy to use graphical interface that can be used to create collections from a wide variety of sources and easily apply on them any available processing component. Processing results can be examined through the large set of available viewers or even exported to widely used formats, such as SGML or XML. Finally, Ellogon offers the ability to automate tasks through the definition of "macro" commands, which can be useful especially in tasks that must be repeated multiple times.
For most users of Ellogon, the central point of interest is the linguistic processing that can be carried out within it. Ellogon provides a generic framework where external components can be easily embedded. As Ellogon follows a modular paradigm, it utilises components of various types, with each type specialising in a specific processing task. A taxonomy of the currently defined component types are shown in the following figure:
The most important component type from the user’s point of view is of course the linguistic processing component, as natural language processors usually belong to this component type. These components (along with components of the machine-learning processing type) can be organised into Systems for performing some specific task. The tasks can range from basic linguistic tasks, such as part-of-speech tagging or parsing, to application level tasks, such as information extraction or machine translation.
A linguistic processing component consists mainly of two parts. The first part is responsible for performing the desired linguistic processing while the main responsibility of the second component part is to interface the linguistic processing sub-component with Ellogon, through the provided API. Components can appear either as wrappers or as native components. Wrappers usually provide the needed code in order to interface an existing independent implementation of a linguistic processing tool to the Ellogon platform. Native components on the other hand are processing tools specifically designed for use within the Ellogon platform. Usually, in such components the two component parts cannot be easily identified or separated.
Each component is associated with metadata, which include a set of pre-conditions and a set of post-conditions among other information. Pre-conditions declare the linguistic information that must be present in a document before this specific component can be applied to it. Post-conditions describe the linguistic information that will be added in the document as a side effect of processing the document with this specific component. Ellogon uses these two sets in order to establish relations among the various components or to "undo" the results of a component application on a corpus.
Each component can also specify a set of parameters, as well as a set of viewers (components of type "visualisation component"). Parameters represent various runtime dependent options (such as the location of a file containing the grammar of a syntactic parser). They can be edited by the end user through the graphical interface and are given to the component every time it is executed. A component can also specify a set of predefined viewers, in order to present in a graphical way the linguistic information produced during the component execution. Examples of available viewers can been seen here and here.
Creating components can be easily done through the Ellogon GUI. Currently, Ellogon components can be developed in five languages, C/C++, Tcl, Java, Perl and Python. The Ellogon GUI offers a specialised dialog where the user can specify various parameters of the component he/she intends to create, including its pre/post-conditions. Then Ellogon creates the skeleton of the new component that will handle all the interaction with the Ellogon platform. If the language of the component is C/C++ or Java, proper Makefiles for compiling the component under Unix and Windows will also be created. Besides creating a skeleton, Ellogon tries to facilitate the development of the component by allowing the developer to edit the source code and reload the specific component into Ellogon from its GUI.