Applications of lossless compression in adaptive text mining

Witten, I. H. (2000) Proc 2000 Conference on Information Sciences and Systems 2,Princeton, USA, March, pp TP6:13-18.

Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information for particular purposes. Compared with the kind of data stored in databases, text is unstructured, amorphous, and contains information at many different levels. Nevertheless, the motivation for trying to extract information from it is compelling–even if success is only partial. Despite the fact that the problems are difficult to define clearly, interest in text mining is burgeoning because it is perceived to have enormous potential practical utility. This paper argues that lossless compression, operating within the standard training/testing paradigm of machine learning, is a key technology for text mining. Research in compression has always taken the pragmatic view that files need to be processed whatever they may contain, rather than the normative approach of classical language analysis which generally assumes idealized input: a sequence of sentences, comprising words all of which appear in the dictionary, delimited by single spaces, with punctuation and perhaps numbers but no other extraneous symbols. In practice text–particularly text gathered from the Web, the principal source of material used today–is messy, and many useful clues come from the messiness.