This tutorial provides some information on how to process text files in "batch" mode, through an Ellogon script, written in Tcl. We will create an application which will process all text files found in a directory, and write processing results as XML files in same directory.

The package ELEP::Macros::ComponentRunnerOnText

Ellogon provides a handy package in its library, which allows the application of a set of components on a text document. This library package can be loaded with the following command:

package require ELEP::Macros::ComponentRunnerOnText

The constructor of the class expects a list of component names, that will be used to process text documents, in the order that were given to the constructor. For example, an object of this class which runs the Greek tokeniser, sentence splitter, and part-of-speech tagger, can be created as follows:

set app [ELEP::Macros::ComponentRunnerOnText new HTokenizer HBrill]

How to process some text

This package provides a set of methods for processing text documents:

  • process "some text": This is the main method for processing text. It accepts some text, and runs all annotator components on this piece of text.
  • process_article "title" "some text" ?optional identifier? ?optional document source?: This method accepts a title and an article body, it concatenates these two into a single piece of text, and calls process to process it.
  • process_file "filename": This method opens a file from the disk, reads its entire content (assuming it contains text encoded as UTF-8 without BOM), and process it by calling process to process it.

The script

A simple Ellogon script, that uses the ComponentRunnerOnText package to process all files in a directory is as follows:

# Load the ELEP::Macros::ComponentRunnerOnText package
package require ELEP::Macros::ComponentRunnerOnText

# Load a serialiser of annotations into XML
package require ELEP::WebApplication::ExportAnnotationsXML

# Define a list of components, that we want to apply on texts...
set components {
HTokenizer
HBrill
}

# Create a component runner object, and place it in variable "runner"
set runner [ELEP::Macros::ComponentRunnerOnText new {*}$components]
$runner initialiseComponents

# Process all text files from directory C:/tmp/texts
foreach file [glob -type f C:/tmp/texts/*.txt] {
# Read the file content, and process it with our runner
$runner process_file $file
# Serialise part-of-speech tags into XML
set xml [::ELEP::WebApplication::ExportAnnotationsXML::export_annotations \
[$runner document] token]
# Save the XML file
set fd [open [file rootname $file].xml w]
fconfigure $fd -encoding utf-8
puts $fd "<doc>\n$xml\n</doc>"
close $fd
}

# We have finished processing, destroy our runner object...
$runner destroy

# Exit ellogon
exit

How to run the script

Assuming that Tcl/Tk 8.6 is installed in C:\TclApps\Tcl, the script is saved in C:\tmp\test.tcl, and Ellogon is installed in C:\TclApps\Ellogon, the script can be executed as follows:

C:\TclApps\Tcl\bin\tclsh.exe C:\TclApps\Ellogon\ellogon C:\tmp\test.tcl

Sample test

Assume that a single file exists in C:\tmp\texts, named C:\tmp\texts\test.txt, with the following content:

Αυτό είναι ένα τέστ.

After the script is executed, a file named C:\tmp\texts\test.xml will be created, with the following content:

<doc>
<attributes>
</attributes>
<annotations>
<token type="GFW" pos="IP">Αυτό</token>
<token type="GLW" pos="VB">είναι</token>
<token type="GLW" pos="IDT">ένα</token>
<token type="GLW" pos="FW">τέστ</token>
<token type="PUNC" pos=".">.</token>
</annotations>
</doc>