| ![]() |
An Information Extraction Core System for
Real World German Text Processing
G?unter Neumann? Rolf Backofeny Judith Baurz Markus Beckerx Christian Braun{
Abstract
This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be customized for processing different tasks in a flexible manner.
1 Introduction
There is no doubt that the amount of textual information electronically available today has passed its critical mass leading to the emerging problem that the more electronic text data is available the more difficult it is to find or extract relevant information. In order to overcome this problem new technologies for future information management systems are explored by various researchers. One new line of such research is the investigation and development of information extraction (IE) systems. The goal of IE is to build systems that find and link relevant information from text data while ignoring extraneous and irrelevant information (Cowie and Lehnert, 1996).
Current IE systems are to be quite successfully in automatically processing large text collections with high speed and robustness (see (Sundheim, 1995), (Chinchor et al., 1993), and (Grishman and Sundheim, 1996)). This is due to the fact that they can provide a partial understanding of specific types of text with a certain degree of partial accuracy using
?DFKI GmbH, Stuhlsatzenhausweg 3, 66123
Saarbr?ucken, Germany, [email protected]
yLMU, Oettingenstrasse 67, 80538 M?unchen, Germany,
bac[email protected]
zDFKI GmbH, [email protected]
xDFKI GmbH, [email protected]
{DFKI GmbH, [email protected]
fast and robust shallow processing strategies (basically finite state technology). They have been made sensitive" to certain key pieces of information and thereby provide an easy means to skip text without deep analysis.
The majority of existing information systems are applied to English text. A major drawback of previous systems was their restrictive degree of portability towards new domains and tasks which was also caused by a restricted degree of re-usability of the knowledge sources. Consequently, the major goals which were identified during the sixth message understanding conference (MUC-6) were, on the one hand, to demonstrate task-independent component technologies of information extraction, and, on the other hand, to encourage work on increasing portability and deeper understanding" (cf. (Grishman and Sundheim, 1996)).
In this paper we report on smes an information extraction core system for real world German text processing. The main research topics we are concerned with include easy portability and adaptability of the core system to extraction tasks of different complexity and domains. In this paper we will concentrate on the technical and implementational aspects of the IE core technology used for achieving the desired portability. We will only briefly describe some of the current applications built on top of this core machinery (see section 7).
2 The overall architecture of smes
The basic design criterion of the smes system is to provide a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be customized for processing different tasks in a flexible manner. Hence, we view smes as a core information extraction system. Customization is achieved in the following directions: