Velingua – terminology extraction

Terminology extraction in German and English

 The terminology extraction takes into account existing terminology as long as this is available in one of the UniTerm Pro, UniTerm Enterprise or TBX formats.

The following formats are supported as text corpora:

  • Text (ANSI or Unicode)
  • PDF
  • XML
  • HTML

The suitable terms are output as a list in a CSV file. There is also the option for the KWIC information to be generated, too.
Velingua terminology extraction can be integrated as a command line tool in other environments. The list of suitable terms can then be corrected and imported into any terminology database via CSV import.

The basic methods for extraction are:

  • Large relative frequency of lemmas (with lemmatising)
  • Morpheme analysis (evaluation of frequent morphemes)
  • N-gram analysis (relatively frequent character sequences – purely statistical process)
  • Multi-word recognition using word class patterns (e.g. adjective/noun sequences in German)

The terminology extraction is highly configurable with

  • Threshold values for minimum occurrence number
  • Control of the result set depending on the size of the text corpus
  • Percentage proportion of the four basic methods
  • Word types of the suitable terms
  • Word class pattern for multi-word recognition
  • Columns in the results file

The terminology extraction can conveniently be configured in the Velingua Organizer. Here, the suitable terms can also be transferred to the terminology database interactively. Classification as a preferred term, allowed term, disallowed term or stop word is possible in the process.