page 1  (12 pages)
2to next section

A Finite-State Kernel Architecture for Turkish Natural Language Processing

Zelal G?ung?ord?u and Kemal Oflazer
Department of Computer Engineering and Information Science
Bilkent University, Bilkent, Ankara, TR-06533, Turkey
fzelal,[email protected]

Abstract:

We present a finite-state kernel architecture for Turkish that performs certain presyntactic processing steps on a given sentence, such as morphological analysis, and recognition of lexicalized and nonlexicalized collocations, followed by morphological disambiguation by voting constraints [7]. The kernel has been implemented using the Xerox Finite State Tools. The approach to recognizing collocations presented here is of particular interest for languages with highly agglutinative morphology (of which Turkish is a very good example), since it only requires a single mechanism to deal with a potentially infinite number of variants of a single collocation in certain cases. Moreover, it may also help resolve morphological ambiguity to some degree.

1 Introduction

Turkish employs a rich collection of collocations, i.e., multi-word constructs that may be considered single syntactic/semantic entities. Recognizing such constructs at a presyntactic level of processing is of interest for both theoretical and practical reasons: It is clear, given the diversity of collocations in Turkish, that dealing with such constructs in syntax would require additional syntactic rules, thereby rendering the grammar rather cumbersome. That further implies that handling those constructs at a presyntactic level would considerably simplify the development of parsers for Turkish. In addition, such functionality may also help resolve morphological ambiguity in cases where one or more of the lexical forms in a collocation has/have various morphological interpretations which are highly unlikely in the context of that particular collocation. Consider, for example, the idiomatic expression in (1), where the second word has actually three ambiguous morphological parses, namely a dative noun, an optative verb, and an adjective. The nominal one is the only plausible parse in the present context so the remaining two can safely be ignored.

(1) ip-e
cord-DAT
sap-a
1. stem-DAT
2. diverge-OPT
sapa
3. secluded

gel-me-yen
come-NEG-PART
`implausible/unreasonable/illogical'

We present here a finite-state kernel architecture for Turkish that recognizes such multi-word constructs at a presyntactic stage of processing, and also involves a finite-state implementation of the morphological disambiguation approach by voting constraints, proposed by Oflazer and T?ur [7]. The kernel has been implemented using the Xerox Finite State Tools.

Apart from the large number of different collocation forms in Turkish, there is also the fact that certain collocations may, in theory, have an infinite number variants, due to the nature of Turkish morphology. Let us take, for instance, the support verb construct y?ur?url?u>=ge koymak `to put in force', and provide only a couple of examples for the possible forms in which this collocation may occur: