| ![]() |
Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction
Kemal Oflazer
Department of Computer Engineering and Information Science
Bilkent University
Ankara, 06533, Turkey
[email protected]
ABSTRACT
This paper presents a notion of error-tolerant recognition with finite state recognizers. Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the basic concepts and algorithms involved, we give examples from two applications: In the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology is fully described by two{level finite state transducers, regardless of the word formation processes. In the context of spelling correction, error-tolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance. Again it can be applied to any language with a word list comprising all inflected forms, or whose morphology is fully described by two{level finite state transducers. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages such as English, Dutch, French, German, Italian (and others) with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41. For spelling correction in Turkish, errortolerant recognition operating with an (circular) recognizer of Turkish words (with about 29,000 states and 119,000 transitions) can generate all candidate words in less than 20 milliseconds, with edit distance 1.
1 Introduction
Error-tolerant finite state recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. For