This collection may be useful for finding solutions to common problems, or simply for tracking the progress of the Greenstone software.
To subscribe to the Greenstone mailing list, please click here.
The Greenstone Archives collection uses the Email plugin, which parses files in email formats. There is one file for each year, and each file contains many email messages. The Email plugin splits these into individual documents, and produces Title, Subject, Headers, From, FromName, FromAddr, Date, and DateText metadata.
The collection configuration file begins with the specification groupsize 200. This groups documents together into groups of 200. Email collections typically have many small documents, and grouping them together prevents Greenstone's internal file structures from becoming bloated and occupying more disk space than necessary. Notice that the Email plugin first splits the input files up into individual Emails, then groupsize groups them together again. This allows the collection designer to control what is going on.
The indexes line specifies four searchable indexes, which can be seen by clicking beside the word "Messages" on the search page to reveal a drop-down menu. The first (called Messages) is created from the document text, while the others are formed from From, Subject, and Headers metadata.
There are three classifiers, based on Subject, FromName, and Date metadata. The AZCompactList classifier used for the first two is like AZList but generates a bookshelf for duplicate items, as illustrated here. This is represented by a tree structure whose nodes are either leaf nodes, representing documents, or internal nodes. A metadata item called numleafdocs gives the total number of documents below an internal node. The format statements for the first classifier, called CL1Vlist, checks whether this item exists. If so the node must be an internal one, in which case it is labeled by its Title. Otherwise the node's label starts with the Subject, then gives From metadata (both name and email address, suitably hyperlinked), followed by the DateText.
The second classifier (CL2Vlist) is similar, but shows slightly different information -- the result can be seen here. For internal nodes, the actual number of leaf documents (numleafdocs) is given in parentheses after the Title; for document nodes the From, Subject, and Date metadata is shown.
The third classifier is a DateList, which allows selection by month and year.
Finally, the document text is formatted to show the header fields followed by the message text (written as [Text] in the format statement). However, there is a subtle twist, and to see what it is you should look at a document in the collection. At the end of the document is a "show all headers" hyperlink, which, when clicked, shows a long list of email headers and changes the hyperlink at the end of the document to "hide headers." The faint of heart should skip the following explanation! The If in the format statement tests cgiargheaders, which in fact determines whether the URL contains a CGI argument called "headers". If so, the Headers metadata is displayed, otherwise it is not. After the the message text has been shown (by [Text]), the cgiargheaders variable is tested again to determine whether to put the "hide headers" or the "show all headers" hyperlink.
There are 4 ways to find information in this collection: