1 Introduction
close this book View the PDF document Coping with very large digital collections using Greenstone : Stefan Boddie, John Thompson, David Bainbridge, Ian H. Witten
View the document 2 Building the Papers Past collection
View the document 3 Distributed operation with IBM’s DB2
View the document 4 Summary and Conclusions
View the document References

Coping with very large digital collections
using Greenstone

Stefan Boddie,† John Thompson,† David Bainbridge‡ and Ian H. Witten‡

Abstract. The Greenstone digital library software is widely used for small to medium digital library collections, but its reputation for creating very large collections is less well established. This paper describes how Greenstone is being used to produce large newspaper collections for the National Libraries of New Zealand and Singapore, respectively. It also describes current developments that integrate IBM’s DB2 database system into Greenstone as an optional search engine and metadata database, which allows the runtime server to be deployed in a federated configuration.

1 Introduction

Greenstone is a suite of software for building and distributing digital library collections. It is not a digital library but a tool for building digital libraries. Developed and distributed in cooperation with UNESCO, it runs on all popular operating systems (even the iPod!). For more details see [1] and the project website [2].

The Greenstone Digital Library Software has been distributed under the GNU General Public License for a dozen years, during which time it has developed almost beyond recognition. Today the user base hails from 70 countries and the reader’s interface has been translated into over 45 languages. Downloads have exceeded 4,500 times a month for many years. It is fair to say that this software has materially helped spread the practical impact of digital library technology throughout the world. Training courses have been offered throughout both the developing and developed worlds [3].

Because of the emphasis on training, most Greenstone collections are rather small, because educational exercises use a few documents to a few hundred. Some trainees return to their home institutions to work on large-scale projects, but many never progress beyond toy collections. Unfortunately this wide accessibility has earned Greenstone a reputation of being suitable for small collections only. This is incorrect.

It is probably unrealistic to expect a general-purpose piece of software, suitable for a wide variety of users and applications, to vie with a purpose-built system designed specifically for a very large information collection (like, say, a web search engine). However, Greenstone can be pushed reasonably far in the size of collections that can be built. This article reports on work by DL Consulting, a company that specializes in digital library solutions using Greenstone.

DL Consulting recently built a large Greenstone-based digital library for the National Library of New Zealand containing over 1 million pages of historic newspapers. Over half (60%) of these have been OCR’d, comprising nearly 20 GB of raw text: 2 billion words, with 60 million unique terms that is full-text searchable. When the project is complete, all images will be fully-searchable. The collection consists of over 6.5 million newspaper articles, each with its own metadata (much of it automatically generated); and the total volume of metadata is 50 GB—three times as much as the raw text! Before being built into a digital library collection the metadata is stored in XML format, which occupies around 600 GB, slightly less than 1 MB per newspaper page. A similar system is underway for Singapore’s National Library Board, and this is projected to grow to four times the size of the fully realized NZ collection.

Currently the National Library of Singapore implementation runs on a single machine. For full scalability it will be necessary to distribute the system over multiple servers. This is already possible to a certain extent by structuring text and image capabilities as separate servers, but further growth will require distributing the search index and metadata database. Greenstone already includes the possibility of incorporating different indexers: two indexers (MP and MPGG) are included as standard, and a third (Lucene) is available but experimental. Through this work Lucene has now been upgraded to become a standard option. Additionally we are evaluating the performance of a fourth, IBM’s DB2 (which is able to serve as a metadata database as well) and has the ability to be distributed over multiple servers.

This article reports on these two developments. We begin by describing the historic newspaper project. Then we discuss the issues involved in incorporating DB2 into Greenstone as an optional indexer and metadata database, and present the results of preliminary experiments.