|Learning structure from sequences, with applications in a digital library|
Witten, I. H. (2001) Machine Learning: Proceedings of the Eighteenth International Conference, San Francisco, California, pp 643.
The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This talk will review recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text. We describe three areas of research: hierarchical phrase browsing, including efficient methods for inferring a phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to word segmentation, generic entity extraction, and acronym extraction; and keyphrase extraction and its application in a digital library.