Greenstone, an Implementation in Afghanistan
A Case Study of the AREU Collection,
by Graeme Foster 1 Abstract This article summaries the process of implementing a Greenstone digital library in Afghanistan, the
work was initiated in the summer of 2006 and completed early in 2007. The article looks at some of the decisions that were made and hopes to provide some assistance to anyone who is evaluating or
looking to configure their own Greenstone collection. 2 Introduction The Afghanistan Research and Evaluation Unit (AREU) is an independent research organisation based in Kabul. AREU's mission is to conduct highquality research that informs and influences
policy and practice. Since 2002, the Afghanistan Research and Evaluation Unit has maintained a library of Afghanistanspecific materials to support its own research. The collection currently has
more than 6,500 catalogued items, primarily in English, Dari and Pashto as well as in other languages (French, German, Italian and Russian). The AREU library’s main aim is not to compete
with international collections, but rather to gather and preserve relevant materials available both within Afghanistan and abroad for use by national and international researchers. The AREU digital catalogue is not a typical Greenstone collection it is not being used as a true digital library, in that items of the collection that have been digitised are not integrated into the
collection. Rather the collection is a comprehensive bibliographic database of the library's holdings with links to the digitised data and where possible these links will be to the original material when it
is available on the Internet (i.e. By citing of the URL). However, despite the fact that the collection was not implementing the digital library features of
Greenstone there were still compelling reasons for using Greenstone. 2.1 Why was Greenstone chosen? The support and use of open standards is a contemporary battle cry and it is easy to get drawn into such a furore but it did have some pertinence with the AREU collection. Much of the information
already existed on an existing database, however there was no support for the database it was fixed to a single client and the character set for the Arabic scripts was nonstandard. These were all
surmountable issues but it was clear that there was a danger of being locked into a proprietary standard. One of the open standards that was identified early one as a must support was UNICODE. It has
become the standard for the support of international scripts, and its adoption at the core level by operating systems and programming languages only confirms its dominance in this arena. The
Greenstone, an Implementation in Afghanistan
1 of 31 AREU collection supports three languages, English and the two main Afghan languages of Pushto and Dari. A second requirement of the collection was to be able to put the collection on the web so that scholars from around the globe can access the collection and where available immediately get their
hands on the soft copy. A third requirement was that the software must be easy to manage. The technical support staff were
already overworked and so managing the new service should not add significantly to their workload. Obviously, this is a difficult parameter to access until the product is up and running and
people are using it. Using it on a test machine looked as if it would be easy to manage and thankfully, when the product went live it ran smoothly and with very little demand on the technical
staff. A final requirement of the organisation was that backups were easy to create. The situation in
Afghanistan is uncertain, it is experiencing a fine balance of normality after many years of instability, but this could deteriorate. Having a backup of all the data was considered a non
negotiable requirement. This was addressed in several ways. First the whole server up is mirrored and the library is physically backed up on a monthly basis. Secondly the whole digital library is
hosted on a server outside of Afghanistan. The real feature that got people excited, and let's face it when talking about backups it is hard to get people excited, was Greenstone's ability to create a live
CD of the collection. This would not only act as a backup but it could be made available to organisations who had no, or limited, access to the Internet, a all too common situation in
Afghanistan. From a technical perspective it was good that Greenstone was an open source product. This meant
that if necessary the code could be changed to meet the specific needs of the collection. Such power is of course a double edged sword. Any change to the code to ensure that the specific requirements
are met means that it breaks any upgrade path. So it's a feature that is great to have, but you just hope that you never need to use it. As it turned out it was needed but only in one small area
requiring the change in a single file. This sort of change is acceptable and that fact that the collection was a huge deviation from a typical digital library is testament to the configurability of
Greenstone. The fact that Greenstone is highly configurable meant that we were able to make the collection look the way we wanted it to look. However, it would only be fair to say that it was a
feature that was not fully appreciated until after it had been configured! 3 The Structure of the Collection 3.1 Look of the collection The general requirement was for the collection to reflect the professional image of the organisation
that the web site has helped to enhance. For individual data items it was thought that the familiar library catalogue cards would be a model that users would be comfortable with. From a
representation perspective only data that exists should be displayed, that is if a publication doesn't
2 of 31
Greenstone, an Implementation in Afghanistan have an author then it shouldn't display an Author title with a blank value.
Illustration 1: A example of the catalogue card Illustration 1 shows the headings, in English and the data of a sample catalogue card. It also shows the link to the soft copy of the document, in this case a word document of the text. 3.2 Indices and searching There was a strong feeling that the elements that comprised the search indices should only come
from the metadata of the collection. The collection consists of several thousand items with a small proportion being available electronically. If the contents of those for which soft copies existed were
indexed then the search results would be skewed in favour of this subset of documents. The browse feature had great potential in assisting researchers find documents that they might not
otherwise find and so browse categories were added for, the author, title, corporate body, subject headings and language. As it turned out the browse categories proved to be an excellent way of validating the uniformity of the data. Inconsistencies in spelling, punctuation or lettercase of the metadata were easily
identified because they would generate an extra entry in the browse category. This turns out to be a quicker and more reliable method than checking each entry manually.
As shown in Illustration 2 the browse categories can be used to identify
inconsistencies in the data. The illustration shows three subject headings
with almost identical value but because of slight variations in data entry, as
shown with a red underline, they become
Illustration 2: A selection of subject headings
separate entities. By browsing through all
the possible values it is easy to spot such issues with the existing data.
Greenstone, an Implementation in Afghanistan
3 of 31 3.3 File structure The catalogue cards are represented by a simple HTML file. It is this file that is displayed by
Greenstone. Each HTML file is associated with an XML file that contains the metadata that Greenstone will be using in its indexing. The details of the HTML file of Illustration 1 is given in
the appendix and it can be seen that the html is laid out using extensive use of HTML tables. The style sheet referred to in this HTML is not used by Greenstone but by the Data Entry system
described later. Each HTML, XML file pair is placed in its own directory, whilst this helps to manage the data this
was mainly done because Greenstone requires the meta data to have a specific file name, and for ease of managing this file it was decided to have one entry per metadata file. The directories of the catalogue cards were broken down into four language groups, Dari, Pushto, English and Misc. Technically English and Misc are interchangeable the others are distinct in that
the titles for the catalogue card will be translated into the appropriate language. The translation of each catalogue card is managed by the Data Entry program, and a catalogue card is presented in the
primary language of the text. 4 Data Entry A small system was written to ease the data entry of each catalogue card and the editing of existing entries. This system would automatically generate the files for each system and it would include
some validation of the data. It also allows for the simultaneous creation and editing of the catalogue cards. This system would run independent of Greenstone but would honour the file system that had
been designed, this meant that it doesn't need to be hosted on a computer that supports Greenstone and it was easy to move it to a computer that had the Greenstone system, such as a backup server. This system is still being developed and enhanced and it is planned to be released to the community under the GPL. 5 Configuring the collection There are two broad areas of collection configuration these have been divided into collection
description, and collection style. 5.1 Collection description The file that describes the collection is the collect.cfg file and is located in the etc directory of the collection folder. This file is included in the Appendix. For the purpose of this collection the important elements of this file is the metadata that is used, the browse categories and what should be displayed in the search and browse results. Additionally this
file includes the plugins that are used and the location of the icons that are to be used for this collection.
4 of 31
Greenstone, an Implementation in Afghanistan 5.1.1 Metadata, selecting, translating and using Whilst the collection was being designed different metadata sets were used in the end we decided to use the rfc1807 collection, because it was the best fit for our requirements. It still meant that
some fields were not quite right and so we wanted to change the name of some meta data items. Because this collection was introducing two new languages to Greenstone it was also necessary to
add the translations for the meta data entities being used. Adding a translation to a metadata element turns out to be a trivial exercise. The following snippet
shows the translation that were required for the author metadata.
# meta data translation
collectionmeta
.document:rfc1807.author [l=en] "Author"
collectionmeta
.document:rfc1807.author [l=prs] "فلءوم"
collectionmeta
.document:rfc1807.author [l=ps] "لاوكيل"
For each metadata element there are three translations one for each of the three languages that are supported by the collection. The [l=en] is language English, [l=prs] is for Dari, Persian
Southern, and [l=ps] is for Pushto. For each of the metadata elements that is being used a
translation was provided and put into this file. The main problem that was encountered with this process stems from the fact that we started with a
different metadata set. In fact we selected different elements from different metadata sets, picking and choosing the element that appeared to be the most appropriate. As the development process
progressed it was clear that this happy go lucky approach was going to be confusing at a later date and so we decided to select a single metadata set and modify the elements when there wasn't an
exact match, for example we use the rfc1807.id for the Accession Number, and rfc1807.keywords is really Subject Headings. A script was written to implement the agreed upon
change in all the records and the collection was rebuilt. When the collection was viewed, sadness all around because, the old metadata elements were still being used. The problem was obscured by the
fact that the English translations were consistent and so at first it appeared as if it had picked up the change but not the new translations. However, once on the right track it was clear to see what was
happening by looking at the output generated by the build. Some of the records still had the old metadata. The conversion script had failed on a few records which was why they were still using
the old metadata. Once these records were fixed then the new metadata was picked up and the browse buttons displayed their translated values, happiness restored. 5.1.2 Browse categories When the browse categories were added it was clear that it was important to select the best way to display the data. This is still an area for debate and hasn't proved to be as popular as the search
which, with the ascendency of search engines on the Internet, has become a pervasive technique in the researcher's toolbox. The problem lies with the fact that much of the meta data is unique and so
the results do not necessarily cluster well. The clear exception to that is the meta data language which is limited and an ideal candidate for the browse categories, unfortunately for the researchers
it is not as useful as the other categories.
Greenstone, an Implementation in Afghanistan
5 of 31 One of the difficulties that we faced was that some of the specific browses classifiers didn't work properly (or at all) with the mixed scripts that we were using so to overcome that we decided to
classify them using the GenericList and then partition the result in blocks of 50. This provides a manageable browse list although it does lead to an excessively long list of classifiers along the top.
This can be seen in Illustration 3 below
Illustration 3: The (rather long) list of Subject Heading Classifiers
Another problem is that when the script switches from the Roman to the Arabic script the direction
of reading changes so (the left to right) Z is associated with the rightmost Arabic (right to left) character on the same line. That is, from Illustration 3 the Z and the ﺍ are associated and not the Z and the ﻝﺏ characters. This abrupt change in presentation can be confusing and it would be better if the meta data in the Roman script and the Arabic script were further divided by a new line. One feature that helped was to have the number of documents under each category. However since a large number of categories had a single document the value of including a number was
diminished. So a slight modification of the format string was done so that the number of documents was displayed only if there is more than one document. This was done as follows:
{if}{[numleafdocs] gt 1, ([numleafdocs])}
This worked well for most browse categories, but there were some formatting problems with the
title where a long title sometimes meant that the cell that contained this value wrapped to the next line, this meant that the browser would then maximise the space available for the title and wrap the
bookshelf icon and the number of books onto two lines. The solution to ensure that this appeared on a single line came in the form of the nonstandard html tag <nobr>.
format VList "<td valign=top><nobr>[link][icon][/link]
{if}{[numleafdocs] gt 1, ([numleafdocs])}</nobr></td>
The rest of the format string was the standard entry for the title. 5.1.3 Search results The search feature was identified as being the main interaction with the collection and it was
6 of 31
Greenstone, an Implementation in Afghanistan expected that researchers would feel comfortable with this so long as it mirrored the plain features of a typical Internet search site. This requires the indexes to be set up and to ensure that the search
results were comprehensive and clear. 5.1.3.1 Indexes The indexes are set up within the collect.cfg file as follows:
indexes
document:text document:rfc1807.author document:rfc1807.title document:rfc1807.keyword
document:rfc1807.series document:rfc1807.id
This creates six indices that are automatically incorporated into the search feature. 5.1.3.2 Results By using the defaults the search results are acceptable and provide a comfortable representation of the matches found.
The search box has been modified slightly to meet the
specific needs that we had. We wanted to simplify the
search process by removing the ability to change the
search features in a separate preferences page and placing
the most common search preferences on the search
dialogue. The changes that were required are described
Illustration 4: Search Results
later.
The results are displayed in a format that we were happy with and so this didn't require any
changes. 5.2 Collection style The primary requirement of the collection style was for it to mirror the organisations web site as closely as possible. This required extensive changes to the macros that define how a Greenstone
collection is displayed and the addition of collection specific style sheets. The most important constraint was that we didn't want to change the core Greenstone macros. Any
changes to core macro files would lead the complications when upgrading and we had reached agreement that the collection would be hosted on the examples page of the Greenstone Digital
Library www.greenstone.org/examples. Thankfully, whilst tempting it is not necessary, to change the core macros. Instead a collection specific macro can be created. The collection specific macro
Greenstone, an Implementation in Afghanistan
7 of 31 should be called extra.dm and will reside in the macros directory of the collection. 5.2.1 Language Selection The organisation's web site allows the user to select the language that they would like to view the
web site in. This facility is available on any page that the user might be visiting. The languages are limited to English, Dari and Pushto. Whereas to select the language in Greenstone requires
navigating to the Preferences page and then all the languages that are available to Greenstone are displayed. This didn't make sense in our context and so the extra.dm macro was changed, the
complete changes can be seen in the appendix and appear at the start of this file. The first important point with this piece of code is that we created some new macros. We decided
that all new macros will be prefixed with the collection name, hence _AREULang_ is a new macro which will hold the dropdown list of available languages. The macro itself is quite simple, it is
repeated for each language, thus _AREULang_ [l=prs] {} is the macro defined for Dari. The macro defines an HTML selection box and depending upon the language the appropriate option is selected
within the HTML. Hence when the Dari language has been selected the macro is as follows:
_AREULang_ [l=prs] {
<option value="prs" selected>
ی
ر د
</option> <option value="en">
English </option>
<option value="ps">
وتښپ
</option> }
Next comes two more collection specific macros. These are used to define cascading style classes that ensure that the dropdown box appears correctly in the rendered HTML page. These are _AREULangCSS_ and _AREUBarCSS_ and they just set up the name of CSS class that needs to be used. The dropdown list is created as follows, again in a collection specific macro.
_AREULangOptions_ { <div class=_AREULangCSS_>
<form name="PrefForm" method="get" action="/gsdl/cgibin/library"> <select name="l" onchange="updatel();">
_AREULang_ </select>
</form> </div>
}
The dropdown list is placed in a form, this is an existing form used by the preference page and a javascript macro is triggered if a new element is selected, onchange. The form is then wrapped in a division tag, which allows the CSS class to be applied to the whole list. In the case of Dari, the style
class will be as follows:
span.langrtl
{
8 of 31
Greenstone, an Implementation in Afghanistan backgroundcolor: #b0cab0; float: left;
margintop: 22px; marginleft: 5px;
}
This only differs from the style class used for the English dropdown in that the float directive is
left, on the English version the float is to the right. Finally the language selection box needs to be placed on the page. That is done by using the
existing macro
{
<div class="navbar"> <div class="_AREUBarCSS_"> SearchTitlesFilenames </div>
_AREULangOptions_ </div>
}
This places the
SearchTitlesFilenames
macro and our new _AREULangOptions_ macro into a couple of
HTML division tags which are formatted using CSS classes, navbar and the language specific class, which for Dari will be barrtl. The final rendering of the navigation bar can be seen in Illustration 6. There is just one last area to cover. In Greenstone the macros are broken down into packages and
since this uses a Greenstone macro,
it is important to define it in the correct macro. This can be done by finding where the macro is originally defined, unfortunately this is a
task of scouring the macro files. The nav_css.dm defines this macro and from this file it is clear that the macro belongs to the Global package, this means that in the extra.dm file the line package global should appear before these definitions. 5.2.2 Query The default simple query that Greenstone generates has two drop downlist that allow the user to
select what part of the document gets searched, and how the words are incorporated into the search, which is either all of the words or some of the words. Finally, of course, there is the text entry area
and a search button. Very simple and efficient, however because we are blocking access to the preferences page the query area requires a little more, as shown in Illustration 4 Search Results. We
have added two dropdown menus from the preferences page. These allow the user to select how many items are returned in the entire search and how many are returned on each page. This is done
with the following code added to the extra.dm file.
_smallquerybox_
{ <br /><input type="text" name="q" value="" size="50"><br />
Return up to
hits with
hits per page. <input type="submit" value="_textbeginsearch_">
}
The original _smallquerybox_ is defined in the query.dm file and belongs to the query package. It
Greenstone, an Implementation in Afghanistan
9 of 31
has the text box and the button on the same line. This is slightly modified by having the text box then on a new line the two new dropdown list followed by the search button. The two dropdown
lists are defined in the _textprefop_ macro, however this macro is defined in a different package, the preferences package, it is used in pref.dm, and so this can be inserted using the syntax, _package:macro_ as follows, Return up to
hits with
hits per page.
We wanted to set the default values so that from over the LAN the maximum results were returned
whilst over the Internet the number of results returned per page were limited. This is done by initially passing the values in through the URL. The value of the number of item returned is held in
the variable called m, whilst the number returned per page is held in the variable called o, so the URL to the collection is modified with the following appended to it, &m=1&o=50. 5.2.3 About Page We are using the about page as the front page to the collection from the web site and it needed to be significantly modified so that it closely mirrored the look of the organisations web site. The
organisations web page is shown in Illustration 5 and the about page is shown in Illustration 6.
Illustration 5: The Organisation's web page As can be seen there is a clear distinction between the two but there is also a familiar theme running
through the two sites. The difference is just to emphasis that the user has moved to a different site but it is important to keep the look so that organisational image is maintained, and emphasised.
Illustration 6: The About Page
10 of 31
Greenstone, an Implementation in Afghanistan The Greenstone page is built up in parts using macros to define each part of the page. To achieve our aims of the page layout we rewrote the header and footer sections. 5.2.3.1 Header The header sections before and after the rewrite can be seen in the following table. Both versions have a banner section whilst the original version has an additional bannerextra section. In the revised version the bannerextra section is incorporated within the banner section, this is because we don't need the flexibility of the original version.
Illustration 7: The Heading section of each Greenstone page
As can be seen this macro uses three other macros,
</div> <! end of page banner >
<! end of page banner >
}
}
The details of these three macros is shown later. This leaves the navigation bar which is defined as
follows:
Original version in nav_css.dm
Modified version in extra.dm
{
<div class="navbar">
<div class="navbar">
<p class="navbar">
<div class="_AREUBarCSS_"> SearchTitlesFilenames SearchTitlesFilenames
</p>
</div>
</div>
_AREULangOptions_
Greenstone, an Implementation in Afghanistan
11 of 31 }
</div> }
The original version wrapped the navigationbar in a paragraph which seemed to complicate the layout of the navigation bar and the language options, so the revised version used HTML divisions
to separate them, this is covered in greater detail in the earlier section on the Language Selection. 5.2.3.2 Footer The footer of each page is as follows:
This is set up by redefining the footer macro as shown below: