View the PDF document Building a public digital library based on full-text retrieval

Witten, I. H., Nevill-Manning, C. G., Cunningham, S. J. (1998) Submitted to Digital Libraries book edited by B. Bharghava et al..

Digital libraries are expensive to create and maintain, and generally restricted to a particular corporation or group of paying subscribers. While many indexes to the World Wide Web are freely available, the quality of what is indexed is extremely uneven. The digital analog of a public library–a reliable, quality, community service–has yet to appear. This paper demonstrates the feasibility of a cost-effective collection of high-quality public-domain information, available free over the Internet. One obstacle to the creation of a digital library is the difficulty of providing formal cataloguing information. Without a title, author and subject database it seems hard to offer the searching facilities normally available in physical libraries. Full-text retrieval provides a way of approximating these services without a concomitant investment of resources. A second is the problem of finding a suitable corpus of material. Computer science research reports form the focus of our prototype implementation. These constitute a large body of high-quality public-domain documents. Given such a corpus, a third issue becomes the question of obtaining both plain text for indexing, and page images for readability. Typesetting formats such as PostScript provide some of the benefits of libraries scanned from paper documents–such as page-based indexing and viewing–without the physical demands and error-prone nature of scanning and optical character recognition. However, until recently the difficulty of extracting text from PostScript seems to have encouraged indexing on plain-text abstracts or bibliographic information provided by authors. We have developed a new technique that overcomes the problem. This chapter describes the architecture, the indexing, collection and maintenance processes, and the retrieval interface, to a prototype public digital library.