MG Introduction

MG Introduction


Introduction

The MG (Managing Gigabytes) system is a collection of programs which comprise a full-text retrieval system. A full-text retrieval system allows one to create a database out of some given documents and then do queries upon it to retrieve any relevant documents. It is "full-text" in the sense that every word in the text is indexed and the query operates only on this index to do the searching.

For example, one could have a database on the book, "Alice in Wonderland." A document could be represented by each paragraph in the book. Having built up the "Alice" database, one could do queries such as "cat alice grin" and retrieve any paragraphs which match the query. The matching could either be boolean, that is the retrieved paragraphs contain a boolean expression of the query terms e.g. "cat alice grin"; or the matching could be ranked i.e. the most relevant documents to the query in relevance order, using some standard heuristic measure.

Motivation

If one wants to find some particular information which is stored in a computer text file then one has a few alternative courses of action. One can operate directly on the text files with utilities such as grep or can process the text files into some form of database. Grep is generally limited to identifying lines by matching on regular expressions. If the collection of files which grep operates on becomes large, then continual passes over the entire text on each query becomes expensive. However, its usage is simple as no auxiliary files must be created.

A database consists of some data and indexes into that data. By having indexes one can query a large database quickly. Standard databases divide the data up into records of fields. This means that the granularity of search is a field. In a full-text system, such as MG, there are no fields (or there is an arbitrary sized list of word fields per document) and instead every word is indexed. Using this method, we can except free-form information and yet be fast on searches. The next question is what is the overhead of this database. In MG most files which are produced are in a compressed form. The two notable compressed files being the given data and the index, called an "inverted file". By compressing the files it is possible to have the size of the database smaller than the size of the source data.

Typical Usage

The most common use for MG has been as a search database on unix mail files. However, any set of text data can be used, one just needs to determine what constitutes a document. MG has also been used on large collections such as Comact (Commonwealth Acts of Australia) which is around 132 megabytes and also on sizes up to around 3 gigabytes for TREC (a mixture of collections such as the Wall Street Journal and Associated Press).

Getting Started with MG

The first thing to do is install the package; please follow the INSTALL instructions. Having done this, it is necessary to set a couple of environment variables. MGDATA should be set to a directory which is to hold subdirectories for each database that you build. For example:
          mkdir ~/mgdata; setenv MGDATA ~/mgdata.
If you want to try out building some sample databases then there is some sample data such as the "Alice In Wonderland" book. To make sure this is accessible you should set the environment variable MGSAMPLE. For example:
          setenv MGSAMPLE ~/mg/SampleData
Here, "~/mg/SampleData" should contain alice13a.txt.Z . To build the Alice database (to be contained in $MGDATA/alice subdirectory), type the command
          mgbuild alice
Assuming all went well and some status messages were printed indicating the build was completed, then type
          mgquery alice
to query the database. You can type a few words at the prompt, hit return and some relevant documents, Alice paragraphs, should be retrieved. Type ".set query ranked" to do ranking queries. Please refer to the mgquery(1) man-page for more information on the commands and options of mgquery(1).

The next thing to do is to use MG on a more personal database. If you keep all your mail in ~/mbox or ~/sentmail, then type

          mgbuild mailfiles

If you have your mail stored in subdirectories of ~/Mail, such as is done if you use the typical set up of elm(1), then change the ~/.mg_getrc line from:

          mailfilesMAIL~/mbox ~/sentmail
to:
          mailfilesMAIL~/Mail/*
and now you may go:
          mgbuild mailfiles

Creating Different Databases

If a user wants to build databases other than for some predefined ones, such as "alice", "davinci", "mailfiles", "allfiles", then the user has a couple of choices. Ultimately (s)he must produce a text file with control-Bs terminating the documents. To do this one can produce one or more such files, write a "get" command (typically in the form of a script or c program), or if the database is one of the standard types of documents (e.g. formed of paragraphs) then modification of the ~/.mg_getrc might suffice.

Using Input Files for mgbuild

If you don't want to write a "get" script and just want to use one or more text files as input, then you must first generate the file with control-Bs. For a simple example, you could take any text file(s) such as "test1.txt" and "test2.txt", and use vi(1) to insert control-Bs by typing "control-V b". Next you should create a file with "set" statements in the following form:

          set pipe = 0 # do not use pipe - use file instead
          set input_files = 'test1.txt test2.txt'

Let's call this file, "build_options".Now issue the command:

          mgbuild -s build_options test

This should build a database called "test" in the $MGDATA directory, based on the source data of "test1.txt" and "test2.txt". The build_options file is simply sourced by mgbuild(1) after it has set up its variables. Therefore, any settings one makes in the build_options file will override the standard settings. See mgbuild(1) for more information.

Writing A Get Program

Instead of using files as input, it is often more convenient to write a "get" program. This program is called by mgbuild(1) to get the text data with control-Bs as document terminators. It should take three options:
  1. -init;
  2. -text;
  3. -cleanup.
Get will be called with "init" first and with "cleanup" at the end. It will call get with "text" when it wants the text and it should write the text to stdout. See mg_get(1) for more details.

Modifying the ~/.mg_getrc file

Mg_get(1) has been extended recently (by B.McKenzie) to read from a .mg_getrc file which maps the particular collection onto the type of collection; for example, it maps the Alice collection onto the PARA (paragraph) type. If a ~/.mg_getrc does not exist then a default one is created. The default ~/.mg_getrc is:
alice   PARA    $SampleData/alice13a.txt.Z
davinci TXTIMG  $SampleData/davinci
mailfiles       MAIL    ~/mbox ~/.sentmail
allfiles        DIR     ~/Mail
Note that tabs must separate the fields.
An example of a modification of this file was given above for mail. See mg_get(1) for more details.

Regular Builds

The MG system provides a static database; there are no update commands. So if one wants to keep one's database reasonably up-to-date then one can have this done automatically on a regular basis by cron(1). A crontab file can be created using: crontab -e A crontab file contains lines of the form:

          minute hour day-of-month month day-of-week shell-command.

See crontab(1) for more information. An example crontab entry is:

          15 02 * * * mgbuild -d /users/jane/mgdata mailfiles >/users/jane/mgdata/mailfiles/mailfiles.log 2>&1

This will build up the mg database for "mailfiles", your mail in the folders, every morning at 2:15am.

Command Structure

There are (atleast) 22 commands that make up the mg system.
However, a user may only need to be aware of a few:
  • mgbuild(1),
  • mgquery(1), and perhaps
  • mg_get(1).
Many of the commands are called by mgbuild(1). mgbuild(1) calls the following commands:
  • mg_passes(1),
  • mg_compression_dict(1),
  • mg_perf_hash_build(1),
  • mg_invf_dict(1),
  • mg_invf_rebuild(1),
  • mg_weights_build(1).
The commands can be broken up into a hierarchy.

     --------------------------------------
     MG--+--image compression
         |  |
         |  +--mgbilevel
         |  |
         |  +--mgfelics
         |  |
         |  +--mgtic
         |  |
         |  +--mgticbuild
         |  |
         |  +--mgticdump
         |  |
         |  +--mgticprune
         |  |
         |  +--mgticstat
         |
         +--text
            |
            +--compression
            |  |
            |  +--mg_passes -T1
            |  |
            |  +--mg_passes -T2
            |  |
            |  +--mg_compression_dict
            |  |
            |  +--mg_fast_comp_dict
            |
            +--indexing
            |  |
            |  +--mg_passes -N1
            |  |
            |  +--mg_passes -N2
            |  |
            |  +--mg_perf_hash_build
            |  |
            |  +--mg_invf_dict
            |  |
            |  +--mg_invf_rebuild
            |
            +--weights
            |  |
            |  +--mg_weights_build
            |
            +--query
            |  |
            |  +--mgquery
            |
            +--tools
               |
               +--mg_invf_dump
               |
               +--mg_text_estimate
               |
               +--mgdictlist
               |
               +--mgstat
     --------------------------------------

Availability

The MG software is available at: http://www.cs.mu.oz.au/mg/ .

See Also

mgbuild(1), mgquery(1),
"Guide To The MG System", in Appendix A of the MG book:
Ian H. Witten, Alistair Moffat, and Timothy C. Bell,
Managing Gigabytes: Compressing and Indexing Documents and Images
second edition, Morgan Kaufmann Publishing,
1999,
xxxi + 519 pages,
US$54.95,
ISBN 1-55860-570-3.
The errata for this book are available at http://www.cs.mu.oz.au/mg/errata.html.

Credits

The MG development is largely the result of research collaboration between:
 

The bulk of the programming work has been carried out by:

  • Stuart Inglis (Waikato)
  • Craig Nevill-Manning (Waikato)
  • Neil Sharman (Melbourne and RMIT)
  • Tim Shimmin (RMIT)

In addition to these, the following people have contributed to the development of the MG software:

  • Lachlan Andrew (RMIT)
  • Tim A.H. Bell (Melbourne)
  • Owen de Kretser (Melbourne)
  • Gary Eddy (Melbourne)
  • Hugh Emberson (Canterbury)
  • Kerry Guise (Waikato)
  • Shane Hudson (Canterbury)
  • Linh Huynh (Melbourne and RMIT)
  • Bohdan S. Majewski (Queensland)
  • Bruce McKenzie (Canterbury)
  • William Weber (RMIT)

The following people have submitted bug reports, suggestions/fixes or ports:

  • Rex Barzee
  • Nelson Beebe
  • Tim A.H. Bell
  • Tim C. Bell
  • Rok Sosic
  • Carl Staelin

Tim Shimmin,
[email protected],
1996

Now kind of maintained by
Alistair Moffat,
[email protected],
August 1999.