View the PDF document Using language models for generic entity extraction

Witten, I. H., Bray, B., Mahoui, M., Teahan, W. J. (1999) Proc ICML’99 Workshop on Machine Learning in Text Data Analysis,edited by D. Mladenic and M. Grobelnik, Bled, Slovenia, pp 25-35.

This paper describes the use of statistical language modeling techniques, such as are commonly used for text compression, to extract meaningful, low-level, information about the location of semantic tokens, or "entities," in text. We begin by marking up several different token types in training documents-for example, people’s names, dates and time periods, phone numbers, and sums of money. We form a language model for each token type and examine how accurately it identifies new tokens. We then apply a search algorithm to insert token boundaries in a way that maximizes compression of the entire test document. The technique can be applied to hierarchically-defined tokens, leading to a kind of "soft parsing" that will, we believe, be able to identify structured items such as references and tables in html or plain text, based on nothing more than a few marked-up examples in training documents.