| ![]() |
DESIGN AND IMPLEMENTATION
OF A
SPELLING CHECKER FOR TURKISH
Ay?s?n Solak and Kemal Of lazer
Department of Computer Engineering and Information Science
Bilkent University
Bilkent, Ankara 06533 T?URK_IYE
E{mail: [email protected], Fax: (90-4)266-4126
(To Appear in Literary and Linguistic Computing, Oxford Univ. Press, 1993)
Abstract: This paper presents the design and implementation of a spelling checker for Turkish. Turkish is an agglutinative language in which words are formed by affixing a sequence of morphemes to a root word. Parsing agglutinative word structures has attracted relatively little attention except for applications areas for general purpose morphological processors. Parsing words in such languages even for spelling checking purposes requires substantial morphological and morphophonemic analysis techniques, and spelling correction (not addressed in this paper) is significantly more complicated. In this paper, we present the design and implementation of a morphological root-driven parser for Turkish word structures which has been incorporated into a spelling checking kernel for on-line Turkish text. The agglutinative nature of the language complex word formations, various phonetic harmony rules, and subtle exceptions present certain difficulties not usually encountered in the spelling checking of languages like English and make this a very challenging problem.
1 INTRODUCTION
Morphological classification of natural languages according to their word structures places languages like Turkish, Finnish, Hungarian, Quechua, and Swahili to a class called agglutinative languages." In such languages, words are formed by combinining root words and morphemes. There is a root and several suffixes are combined to this root in order to modify and/or extend its meaning. What characterizes agglutinative languages is that stem formation by affixation to previously derived stems is extremely productive [6]. A given stem, even though itself may be quite complex, can generally serve as basis for even more complex words. Consequently, agglutinative languages contain words of considerable complexity, and parsing such word structures for correctness and structural analysis necessitates a thorough morphological and morphophonemic analysis.
Morphological parsing has attracted relatively little attention in computational linguistics. The reason is that nearly all parsing research has been concerned with English, or with languages morphologically similar to English. Since in such languages words contain only a small number of affixes, or none at all, almost all of the parsing models for them consider recognizing those affixes as being trivial, and thus do not make morphological analyses. In agglutinative languages, words contain no direct indication of where the morpheme boundaries are, and furthermore morphemes take a shape dependent on the morphological and phonological context. A morphological parser requires [6]:
1. A morphophonological component which mediates between the surface form of a morpheme as encountered in the input text and the lexical form in which the morpheme is stored in the morpheme inventory, i.e., a means of recognizing variant forms of morphemes as the same, and
2. A morphotactic component which specifies which combinations of morphemes are permitted.
Morphological parsing algorithms may be divided into two classes as affix stripping and root-driven