![]() | ![]() | Coping with very large digital collections using Greenstone : Stefan Boddie, John Thompson, David Bainbridge, Ian H. Witten |
![]() |
|
Coping with very large digital collections
using Greenstone
†DL
Consulting Ltd
Innovation Park
Hamilton, New Zealand
{Stefan, john}@dlconsulting.com
‡Department
of Computer Science
University of Waikato
Hamilton, New Zealand
{davidb,
ihw}@cs.waikato.ac.nz
Abstract. The
Greenstone digital library software is widely used for small to medium digital
library collections, but its reputation for creating very large collections is
less well established. This paper describes how Greenstone is being used to
produce large newspaper collections for the National Libraries of New Zealand
and
Greenstone is a suite of software for building and distributing digital library collections. It is not a digital library but a tool for building digital libraries. Developed and distributed in cooperation with UNESCO, it runs on all popular operating systems (even the iPod!). For more details see [1] and the project website [2].
The Greenstone Digital Library Software has been distributed under the GNU General Public License for a dozen years, during which time it has developed almost beyond recognition. Today the user base hails from 70 countries and the reader’s interface has been translated into over 45 languages. Downloads have exceeded 4,500 times a month for many years. It is fair to say that this software has materially helped spread the practical impact of digital library technology throughout the world. Training courses have been offered throughout both the developing and developed worlds [3].
Because of the emphasis on training, most Greenstone collections are rather small, because educational exercises use a few documents to a few hundred. Some trainees return to their home institutions to work on large-scale projects, but many never progress beyond toy collections. Unfortunately this wide accessibility has earned Greenstone a reputation of being suitable for small collections only. This is incorrect.
It is probably unrealistic to expect a general-purpose piece of software, suitable for a wide variety of users and applications, to vie with a purpose-built system designed specifically for a very large information collection (like, say, a web search engine). However, Greenstone can be pushed reasonably far in the size of collections that can be built. This article reports on work by DL Consulting, a company that specializes in digital library solutions using Greenstone.
DL Consulting recently built a large
Greenstone-based digital library for the National Library of New Zealand
containing over 1 million pages of historic newspapers. Over half (60%) of
these have been OCR’d, comprising nearly 20 GB of raw text: 2 billion words,
with 60 million unique terms that is full-text searchable. When the project is
complete, all images will be fully-searchable. The collection consists of over
6.5 million newspaper articles, each with its own metadata (much of it
automatically generated); and the total volume of metadata is 50 GB—three times
as much as the raw text! Before being built into a digital library collection
the metadata is stored in XML format, which occupies around 600 GB, slightly
less than 1 MB per newspaper page. A similar system is underway for
Currently the National Library of Singapore implementation runs on a single machine. For full scalability it will be necessary to distribute the system over multiple servers. This is already possible to a certain extent by structuring text and image capabilities as separate servers, but further growth will require distributing the search index and metadata database. Greenstone already includes the possibility of incorporating different indexers: two indexers (MP and MPGG) are included as standard, and a third (Lucene) is available but experimental. Through this work Lucene has now been upgraded to become a standard option. Additionally we are evaluating the performance of a fourth, IBM’s DB2 (which is able to serve as a metadata database as well) and has the ability to be distributed over multiple servers.
This article reports on these two developments. We begin by describing the historic newspaper project. Then we discuss the issues involved in incorporating DB2 into Greenstone as an optional indexer and metadata database, and present the results of preliminary experiments.