|Using language models for generic entity extraction|
Witten, I. H., Bray, B., Mahoui, M., Teahan, W. J. (1999) Proc ICML’99 Workshop on Machine Learning in Text Data Analysis,edited by D. Mladenic and M. Grobelnik, Bled, Slovenia, pp 25-35.
This paper describes the use of statistical language modeling techniques, such as are commonly used for text compression, to extract meaningful, low-level, information about the location of semantic tokens, or "entities," in text. We begin by marking up several different token types in training documents-for example, people’s names, dates and time periods, phone numbers, and sums of money. We form a language model for each token type and examine how accurately it identifies new tokens. We then apply a search algorithm to insert token boundaries in a way that maximizes compression of the entire test document. The technique can be applied to hierarchically-defined tokens, leading to a kind of "soft parsing" that will, we believe, be able to identify structured items such as references and tables in html or plain text, based on nothing more than a few marked-up examples in training documents.