page 1  (7 pages)
2to next section

Morphological Disambiguation by Voting Constraints

Kemal Oflazer and G?okhan T?ur

Department of Computer Engineering and Information Science

Bilkent University, Bilkent, Ankara, TR-06533, TURKEY

fko,[email protected]

Abstract

We present a constraint-based morphological disambiguation system in which individual constraints vote on matching morphological parses, and disambiguation of all the tokens in a sentence is performed at the very end by selecting parses that receive the highest votes. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the rule developer from worrying about potentially conflicting rule sequencing found in other systems. The vote of each rule is determined by its complexity measured with the kind and number of features used in the rule. We have applied our approach in a system for morphological disambiguation of Turkish, a language with complex agglutinative word structures, displaying rather different types of morphological ambiguity not found in languages like English. Our results indicate that using about 500 constraint rules and some additional simple statistics, we can attain a recall of 95- 96% and a precision of 94-95% with about 1.01 parses per token. Our system is implemented in Prolog and we are currently investigating an efficient implementation based on discrimination networks used in AI production systems

1 Introduction

Automatic morphological disambiguation is a crucial component in higher level analysis of natural language text corpora. Morphological disambiguation also facilitates parsing, essentially by performing a certain amount of ambiguity resolution using relatively cheaper methods. There has been a large number of studies in tagging and morphological disambiguation using various techniques. Partof-speech tagging systems have used either a statistical approach where a large corpora has been used to

train a probabilistic model which then has been used to tag new text, assigning the most likely tag for a given word in a given context (e.g., Church (1988), Cutting et al. (1992), DeRose (1988)). Rule-based or constraint-based approaches recently most prominently exemplified by the Constraint Grammarwork (Karlsson et al., 1995; Voutilainen, 1995b; Voutilainen, Heikkil?a, and Anttila, 1992; Voutilainen and Tapanainen, 1993), employ a large number of handcrafted linguistic constraints are used to eliminate impossible tags or morphological parses for a given word in a given context. Brill (1992; 1994; 1995) has presented a transformation-based learning approach, which induces tagging rules from tagged corpora.

In contrast to languages like English, for which there is a very small number of possible word forms with a given root word, and a small number of tags associated with a given lexical form, languages like Turkish or Finnish with very productive agglutinative morphology where it is possible to produce thousands of forms (or even millions (Hankamer, 1989)) from a given root word, pose a challenging problem for morphological disambiguation. Our prior attempts in developing constraint-based disambiguation systems for Turkish have been hampered to a certain extent by the idiosyncrasies of rule ordering whereby minor changes to the structure and/or ordering of the rules caused massive breakdowns in performance.

This paper presents a novel approach to constraint based morphological disambiguation which relieves the rule developer from worrying about conflicting rule ordering requirements and constraints. The approach depends on assigning weights to constraints according to their complexity and specificity, and then letting constraints vote on matching parses of a given lexical item. This approach does not reflect the outcome of matching constraints to the set of morphological parses immediately. Only after all applicable rules are applied to a sentence, all tokens are disambiguated in parallel. Thus, the outcome of the rule applications is not dependent on the order of rule applications. Rule ordering issue has been discussed by Voutilainen(1994), but he has re-