For example, one could have a database on the book, "Alice in Wonderland." A document could be represented by each paragraph in the book. Having built up the "Alice" database, one could do queries such as "cat alice grin" and retrieve any paragraphs which match the query. The matching could either be boolean, that is the retrieved paragraphs contain a boolean expression of the query terms e.g. "cat alice grin"; or the matching could be ranked i.e. the most relevant documents to the query in relevance order, using some standard heuristic measure.
A database consists of some data and indexes into that data. By having indexes one can query a large database quickly. Standard databases divide the data up into records of fields. This means that the granularity of search is a field. In a full-text system, such as MG, there are no fields (or there is an arbitrary sized list of word fields per document) and instead every word is indexed. Using this method, we can except free-form information and yet be fast on searches. The next question is what is the overhead of this database. In MG most files which are produced are in a compressed form. The two notable compressed files being the given data and the index, called an "inverted file". By compressing the files it is possible to have the size of the database smaller than the size of the source data.
mkdir ~/mgdata; setenv MGDATA ~/mgdata.If you want to try out building some sample databases then there is some sample data such as the "Alice In Wonderland" book. To make sure this is accessible you should set the environment variable MGSAMPLE. For example:
setenv MGSAMPLE ~/mg/SampleDataHere, "~/mg/SampleData" should contain alice13a.txt.Z . To build the Alice database (to be contained in $MGDATA/alice subdirectory), type the command
mgbuild aliceAssuming all went well and some status messages were printed indicating the build was completed, then type
mgquery aliceto query the database. You can type a few words at the prompt, hit return and some relevant documents, Alice paragraphs, should be retrieved. Type ".set query ranked" to do ranking queries. Please refer to the mgquery(1) man-page for more information on the commands and options of mgquery(1).
The next thing to do is to use MG on a more personal database. If you keep all your mail in ~/mbox or ~/sentmail, then type
mgbuild mailfiles
If you have your mail stored in subdirectories of ~/Mail, such as is done if you use the typical set up of elm(1), then change the ~/.mg_getrc line from:
mailfilesMAIL~/mbox ~/sentmailto:
mailfilesMAIL~/Mail/*and now you may go:
mgbuild mailfiles
set pipe = 0 # do not use pipe - use file instead set input_files = 'test1.txt test2.txt'
Let's call this file, "build_options".Now issue the command:
mgbuild -s build_options test
This should build a database called "test" in the $MGDATA directory, based on the source data of "test1.txt" and "test2.txt". The build_options file is simply sourced by mgbuild(1) after it has set up its variables. Therefore, any settings one makes in the build_options file will override the standard settings. See mgbuild(1) for more information.
alice PARA $SampleData/alice13a.txt.Z davinci TXTIMG $SampleData/davinci mailfiles MAIL ~/mbox ~/.sentmail allfiles DIR ~/MailNote that tabs must separate the fields.
minute hour day-of-month month day-of-week shell-command.
See crontab(1) for more information. An example crontab entry is:
15 02 * * * mgbuild -d /users/jane/mgdata mailfiles >/users/jane/mgdata/mailfiles/mailfiles.log 2>&1
This will build up the mg database for "mailfiles", your mail in the folders, every morning at 2:15am.
-------------------------------------- MG--+--image compression | | | +--mgbilevel | | | +--mgfelics | | | +--mgtic | | | +--mgticbuild | | | +--mgticdump | | | +--mgticprune | | | +--mgticstat | +--text | +--compression | | | +--mg_passes -T1 | | | +--mg_passes -T2 | | | +--mg_compression_dict | | | +--mg_fast_comp_dict | +--indexing | | | +--mg_passes -N1 | | | +--mg_passes -N2 | | | +--mg_perf_hash_build | | | +--mg_invf_dict | | | +--mg_invf_rebuild | +--weights | | | +--mg_weights_build | +--query | | | +--mgquery | +--tools | +--mg_invf_dump | +--mg_text_estimate | +--mgdictlist | +--mgstat --------------------------------------
Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images second edition, Morgan Kaufmann Publishing, 1999, xxxi + 519 pages, US$54.95, ISBN 1-55860-570-3.The errata for this book are available at http://www.cs.mu.oz.au/mg/errata.html.
The bulk of the programming work has been carried out by:
In addition to these, the following people have contributed to the development of the MG software:
The following people have submitted bug reports, suggestions/fixes or ports:
Now kind of maintained by
Alistair Moffat,
[email protected],
August 1999.