page 1  (12 pages)
2to next section

A Tool for Tagging Turkish Text

Kemal Oflazer and _Ilker Kuru?oz
Department of Computer Engineering and Information Science
Bilkent University
Bilkent, Ankara 06533 TURKEY
E-mail: fko,[email protected]
Fax: (0 312) 266-4126

Abstract Automatic text tagging is an important component in higher level analysis of text corpora, and its functionality output can also be used in many natural language processing applications. This paper describes a part-of-speech (POS) tagger for Turkish text. It is based on a full scale two-level morphological specification of Turkish implemented on the PC-KIMMO environment, augmented with statistical information compilation and use, multi-word construct recognition and constraint- and heuristics-based morphological, and POS ambiguity resolution. The tagger also has additional functionality for fine tuning of the morphological analyzer, such as logging erroneous parses, commonly used roots etc. The output of the tagger can be used in further syntactic and semantic analysis.

1 Introduction

As a part of large scale project on computational studies on the Turkish language, we have undertaken development of a number of tools for analyzing Turkish text. This paper describes one such tool { a text tagger { for Turkish. The tagger is based on a full scale two-level morphological specification of Turkish implemented on the PC-KIMMO environment [1, 12] and represents substantial improvement over our previous work [11]. In this paper we describe the functionality of our tagging application along with various techniques that we have employed to deal with various sources of ambiguities.

2 Tagging Text

Automatic text tagging is an important step in discovering the linguistic structure of large text corpora. Basic tagging involves annotating the words in a given text with various pieces of information, such as part-of-speech and other lexical features. Part-of-speech tagging facilitates higher-level analysis, such as syntactic parsing, essentially by performing some ambiguity resolution using relatively cheaper methods.

The most important functionality of a tagger is the resolution of the parts-of-speech of the lexical items in the text. This, however, is not a very trivial task since many words are in general ambiguous in their part-of-speech, for various reasons. In English, for example a word such as make can be verb or a noun. In Turkish even though there are ambiguities of such sort, the agglutinative nature