Velingua - terminology extraction

Terminology extraction in German and English

The terminology extraction takes into account existing terminology as long as this is available in one of the UniTerm Pro, UniTerm Enterprise or TBX formats.

The following formats are supported as text corpora:

Text (ANSI or Unicode)
XLIFF
PDF
XML
HTML

The suitable terms are output as a list in a CSV file. There is also the option for the KWIC information to be generated, too.
Velingua terminology extraction can be integrated as a command line tool in other environments. The list of suitable terms can then be corrected and imported into any terminology database via CSV import.

The basic methods for extraction are:

Large relative frequency of lemmas (with lemmatising)
Morpheme analysis (evaluation of frequent morphemes)
N-gram analysis (relatively frequent character sequences – purely statistical process)
Multi-word recognition using word class patterns (e.g. adjective/noun sequences in German)

The terminology extraction is highly configurable with

Threshold values for minimum occurrence number
Control of the result set depending on the size of the text corpus
Percentage proportion of the four basic methods
Word types of the suitable terms
Word class pattern for multi-word recognition
Columns in the results file

The terminology extraction can conveniently be configured in the Velingua Organizer. Here, the suitable terms can also be transferred to the terminology database interactively. Classification as a preferred term, allowed term, disallowed term or stop word is possible in the process.

Cart

Velingua – terminology extraction

Contact

Acolada GmbH

Additional hints