Unsupervised Learning in Constraint-based
Kemal Oflazer and G?okhan T?ur
Department of Computer Engineering and Information Science
Bilkent University, 06533, Bilkent, Ankara, TURKEY
February 12, 1996
This paper presents a constraint-based morphological disambiguation approach that uses unsupervised learning component to discover some of the constraints it uses. It is specifically applicable to languages with productive inflectional and derivational morphological processes, such as Turkish, where morphological ambiguity has a rather different nature than that found in languages like English. Our approach starts with a set of corpus-independent hand-crafted rules that reduce morphological ambiguity (hence improve precision) without sacrificing recall. It then uses an untagged training corpus in which all lexical items have been annotated with all possible morphological analyses, incrementally proposing and evaluating additional (possibly corpus dependent) constraints for disambiguation of morphological parses using the constraints imposed by unambiguous contexts. These rules choose parses with specified features. It then learns in an unsupervised manner, additional rules for removing parses with certain features. In certain respects, our approach has been motivated by Brill's recent work , but with the observation that his transformational approach is not directly applicable to languages like Turkish. Our results indicate that using hand-crafted rules and rules learned to choose, we can attain a recall of 99.08% and a precision of 88.08% with 1.119 parses per token, on the training text. When rules learned to delete are used in addition to these, we can attain a recall of 96.76% and a precision of 92.05% and 1.051 parses per token on the training text. On previously unseen text, we can attain a recall of 98.04% and a precision of 86.23% with 1.137 parses per token using just the hand-crafted rules and rules learned to choose. When rules learned to delete are used we can attain a recall of 96.99% and a precision of 88.13% and 1.100 parses per token.
Automatic morphological disambiguation is a very crucial component in higher level analysis of natural language text corpora. Morphological disambiguation facilitates parsing, essentially by performing a certain amount of ambiguity resolution using relatively cheaper methods (e.g.,). There has been a