Search in at level for of the words  

 

About this collection

This collection demonstrates how to combine Word and PDF versions of the same document together in Greenstone. As representative documents, two articles about Greenstone (one published at the Joint ACM/IEEE conference on Digital Libraries, and the other published at the International Conference on Asian Digital Libraries) are used. Both articles were written in Microsoft Word, with the "save as" option used to generate PDF versions.

In the built collection, the Word version of the document is used to form the basis for what Greenstone indexes in the digital library collection and the structure of the documents that are formed. This structure is based on the style information used in Word to demark sections (e.g., Heading 1) and subsections (e.g., Heading 2); the information is used to break the Greenstone version of the document into sections and subsections, as well as form a table of contents.

When it comes to providing access to the document, the collection designer has decided it is better to serve up the PDF version of the document instead of the Word version. In a digital library context, this format has several advantages over the Word version: first and foremost the PDF version provides a higher degree of provenance as it can be, for example, digitally signed and made to be non-editable (using, for instance, the PDF/A form of the format specifically designed for digital preservation and archiving); secondly, Adobe makes the software application for displaying PDF files freely available on all the major operating systems, as well as a wide range of mobile platforms, such as Android and iOs. This differs from Microsoft Word, which is sold commercially and is restricted in the operating systems it runs on.

The key idea

The key to how this collection is set up is that the Word and PDF versions of a document deliberately have the same filename—only the file extension is different. This is something that is quite simple to achieve in practice as it reflects common practice when a document is published in PDF form. This convention is then exploited by the associate_ext plugin option at build-time in Greenstone, an option that allows variants of a document to be grouped together and treated by Greenstone as a single document, based on similarity of filename.

In this example collection we set this option in the WordPlugin to be pdf. The result of this setting is that it makes the Word version of the document the dominant form in the collection that is built—the text that Greenstone extracts for indexing purposes comes from the Word document, as does any document structure, as mentioned above—and any PDF version of the document with the same filename is bound to it as an associated file.

Building the collection at this point will have the effect that internally Greenstone will have captured this relationship between the different file versions of the same documents; however, until we make some adjustments to the format statements none of this will be visible to the end-user. The collection built at this point (with default settings) will allow a user to search the text from the Word documents, browse by title metadata and so on, but when it comes to the point of viewing a document there will only be the choice of viewing the Word version of the document, or the HTML version that Greenstone automatically generates by processing the Word document.

To go beyond this, the key change to make is to alter the part of default VList statement that says:

<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>

to:

<td valign="top">[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]</td>

Two things occur in this edit. The main difference is the switch from using ex.srclink and ex.srcicon to provide links to the primary source document (which is the Word document), and replace it with a hyperlink around an icon to the document that Greenstone has associated as an equivalent document (which is the PDF version). The icon Greenstone chooses to show is based on the filename extension of the matching file it has found. In this case View the PDF document.

The second (more minor) change in this edit is to simplify the statement a bit. The original uses an {Or} statement to show a thumbnail version of the document if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the {Or} combination and going straight to the ex.equivDocIcon metadata item.

This is the essence of the collection design. We now move on, for the interested reader, to and give a full account of the changes that have been made to the starting default collection to effect the resulting collection seen here.

The gory details

To develop your own version of this collection you can either download its collect.cfg for self-study, or work through the following steps.

  1. Start a new collection by selecting, File → New.
  2. Enter an appropriate description for your collection.
  3. In the Gather tab, copy in your sample documents (which we assume are paired as Word/PDF documents)
  4. In the Design tab, delete the index for ex.Source and the Browser Classifier for ex.Source as we will not be making use of these.
  5. Now select the WordPlugin, and press the Configure Options ... button.
  6. In the resulting popup, set the associate_ext option to pdf. You will need to scroll most of the way down to the bottom of the plugin options listed to find this.

    Note 1: as this is an option that is categorized under the BasPlugin heading, it is therefore an option that is available across all the plugins provided by Greenstone. In our example, we happen to be binding PDF documents to Word documents, however it could equally be binding MP3 versions of files to PNG artwork of album covers.

    Note 2: More than one filename extension can be provided as part of this option, separated by a comma. For example, setting the value in TextPlugin to avi,png would allow both an AVI video file (say an oral history interview) and a PNG image (say a picture of the interviewee taken at the time of the recording) to bind to a text version of the document (say representing a transcript of the interview). Both AVI and PNG versions of the file can be present at the same time, or alternatively only one of the two file types, or neither, and Greenstone will process the situation accordingly.

    Note 3: The option associate_ext is in fact a simplified version of a more general option associate_tail_re. Using regular expression syntax, the latter provides a more powerful way of manipulating filenames. Rather than focus on just the filename extension, with associate_tail_re, one is able to group files together that share a similar filename root, but might start to differ in characters before the filename extension. For instance the Word version of the document might be my-article.doc but the PDF version might be my-article-ver13.pdf reflecting the fact that the PDF file is saved in version 1.3 of this format. Using associate_tail_re (and a little bit of regular expression know-how!), such differences can be surmounted, and the two files still processed automatically as different versions of the same document.

  7. To acquire an improved presentation of the HTML version of the document generated by Greenstone from the Word format, optionally set for WordPlugin the windows_scripting option if building on Windows, or the open_office_scripting option if this extension has been added to your Greenstone installation and either OpenOffice or LibreOffice is available on your system.
  8. Optionally set the level1_heading to heading\s*1, or whatever is appropriate for your documents if they use style information for headings that deviate from the norm for Word. Repeat as is needed for level2_heading and so forth. For more details on how to control sections within a Word document, see the Enhanced Word document handling tutorial.
  9. In GLI, or otherwise, assign appropriate dc.Title and dc.Creator metadata to your documents.
  10. Now switch to the Format tab and edit the format statement for VList (All).

    Change:

     <td valign="top">[link][icon][/link]</td>
    <td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
    <td valign="top">[highlight]
    {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
    [/highlight]{If}{[ex.Source],
    <i>([ex.Source])</i>}</td>

    To:

     <td valign="top">[link][icon][/link]</td>
    <td valign="top">[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]</td>
    <td valign="top">[highlight]
    {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
    [/highlight]{If}{[dc.Creator],: <i>[sibling(All'\, '):dc.Creator]</i>}</td>

    Note: When Greenstone encounters a file that matches the provided associate_ext value (pdf in our case), it sets the metadata value ex.equivDocIcon for that document to be the macro _iconXXX_, where XXX is whatever the filename extension is (so _iconpdf_ in our case). As long as there is an existing macro defined for that combination of the word icon and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For pdf the displayed icon will be View the PDF document.

  11. Finally build the collection and preview the result.

How to find information in the Associated Files collection

There are 2 ways to find information in this collection:

  • search for particular words that appear in the text by clicking the Search button
  • browse documents by Title by clicking the Titles button