Research project

From Parallel Corpus to Wordnet


Norsk versjon her


The aim of this project is the further development and testing of a method for the automatic derivation of wordnets (semantic nets, concept nets - i.e., semantically classified lexical databases) from translational corpora. The method has been developed by Helge Dyvik.

Wordnets are a language technology resource of increasing importance, with several applications. Among other things, they allow content-based information retrieval, automatic logical inference, and improved machine translation. Parallel corpora are text collections consisting of originals and translations in two or more languages, where the originals and their translations have been aligned on the level of sentences, or more rarely also on the level of words.

The method takes translational correspondences from a parallel corpus as a starting point. On the basis of the network of translational correspondences, word senses are distinguished and semantic relations are calculated automatically, e.g. hypero- vs. hyponyms (animal vs. dog, good vs. kind), and the result is represented in a complex lattice structure. The aim of the project is to apply and test the method on a large scale against a Norwegian/English parallel corpus (ENPC). This involves among other things word alignment of the corpus, extraction and processing of data from the corpora, and evaluation of the algorithms and the derived lattices.

A successful result would mean that parts of the work towards the development of wordnets can be automatised.

Project description
(in Norwegian)

Project leader:
Helge Dyvik

Project participants:
Knut Hofland
Paul Meurer
Sindre Sørensen
Martha Thunes

Project period:
April 2001-March 2004

Financing:
2001-2002 financed by
L. Meltzers høyskolefond.
2002-2004 financed by
The Research Council of Norway

Papers

Web Demo

strek Offisiell side