Table of Contents NAME mgquery - query program for the mg system SYNOPSIS mgquery
Table of Contents
NAME
mgquery - query program for the mg system
SYNOPSIS
mgquery [ -h ] [ -D ] [ -f name ] [ -d directory ] [ collection-name ]
DESCRIPTION
mgquery enables users to make Boolean or ranked queries from a data base generated by the mg(1) system. It accepts queries from stdin and sends the retrieved documents to stdout. Information on the resource usage of mgquery as it processes queries can be obtained interactively.
OPTIONS
Options may appear in any order, but the collection-name, if specified, must be last.
- -h
- This displays a usage line on stderr.
- -D
- This option causes the entire text to be decompressed and sent to stdout.
- -f name
- This specifies the base name of the document collection that will be used. If a collection with the specified base name does not exist, an error message will be displayed and mgquery will exit.
- -d directory
- This specifies the directory where the document collection can be found.
USAGE
Prior to processing the command line arguments, the mgquery program attempts to read in a startup script called ./.
mgrc. If that fails, it attempts to read in the file $
HOME/.
mgrc. The startup file can only contain commands-no queries are permitted in the .
mgrc file. Lines starting with "#" in the file are comments. The most common use for the .
mgrc file is to personalise the initial values of the predefined parameters with .set commands.
The input to mgquery consists of a series of input lines. The backslash character " .}S 3 1 "(" " "\"" "" ")"" "" "" is used at the end of lines to indicate that input continues on the next line.
Input lines on which the first character is a dot " .}S 3 1 "(" " "."" "" ")"" "" "" are commands to the mgquery program. Input lines that do not start with a dot are queries.
A query consists of two parts. One part is a Boolean or ranked query that identifies documents. The second part is a post-processing pattern matching operation. Any text between the first speech mark (") and the last speech mark (") is considered to be a post-processing pattern.
COMMANDS
The mgquery program can accept the following commands.
- .help
- Display several pages of help text.
- .quit
- Quit the program.
- .warranty
- Display the mg(1) warranty.
- .conditions
- Display the conditions of use and distribution of mg(1).
- .set name value Set the parameter name to the specified
- value. If the parameter is a Boolean value and the value is omitted, the parameter will be inverted (i.e., if it was true, then it will change to false; if it was false, then it will change to true).
- .unset name
- Delete the parameter name from the currently-defined parameters.
- .reset
- Reset the parameters to the state that they had after the processing of the mgquery command line.
- .display
- Display the values of all the currentlydefined parameters.
- .push
- Push the currently-defined parameters onto a stack.
- .pop
- Pops a set of parameters off the stack, replacing the currently-defined ones.
- .output arg
- This is used to specify where to send the text of the documents. Once the .output command is specified, all subsequent output will be sent to the place specified by arg. If arg is not specified subsequent output will be directed to stdout. Arg may be any of the following.
>
filename Send output to the specified file.
>> filename Append output to the specified file.
- | command
- Pipe the output to command, which is executed by sh.
- .input arg
- This is used to specify where input (queries and commands) comes from. Once the .input command is specified all subsequent input will be come from the place specified by arg. If arg is not specified subsequent input will come from stdin.
<
filename Get input from the specified file.
- | command
- The input comes from the standard output of command, which is executed by sh.
PARAMETERS
The following parameters are predefined and have special significance. Each parameter will be followed by its default value. Parameters are initialised before the .
mgrc file is read or the command line arguments are processed.
accumulator_method `array'
This parameter is used during ranking, and specifies how the weight for each document should be accumulated. The following methods are available: array, splay_tree, hash_table, and list.
briefstats `off' This is a Boolean parameter that determines whether the totals for disk, memory and time usage statistics will be displayed at the end of each query. Note: this takes precedence over the parameters diskstats, memstats and timestats. This parameter may take the values yes, no, true, false, on or off.
buffer `1048576' When the documents are being read in, they are read into a buffer of this size and then displayed from this buffer. If the documents are larger than this buffer, the buffer is expanded automatically. Having a large buffer gives a very slight performance improvement, because it allows the order of disk operations to be optimised. The buffer size is measured in bytes.
diskstats `off' This is a Boolean parameter that determines whether the disk usage statistics for the preceding query will be displayed after each query. This parameter may take the values yes, no, true, false, on or off.
doc_sepstr `---------------------------------- %n\n' This specifies the string that will be used to separate documents when they are displayed for `Boolean' or `docnums' queries. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be `\n'. To include a `%', use the sequence `%%'. To include the mg(1) document number, use the sequence `%n'. The following escape character sequences are available
- Sequence
- Meaning
- `\\'
- backslash
- `\b'
- backspace
- `\f'
- formfeed
- `\n'
- newline
- `\r'
- carriage return
- `\t'
- tab
- `\"'
- speech marks
- `\''
- quote mark
- `\xhh'
- ASCII code in hexadecimal
- `\nnn'
- ASCII code in octal
expert `
false' If this is
true, then much of the dialogue output is suppressed. This parameter may take the values
yes,
no,
true,
false,
on or
off.
hash_tbl_size `1000'
One of the options during ranking queries is to use a hash table to accumulate the weights for each document. The hash table is a simple chained type. This parameter specifies the size of the hash table and may take any value between 8 and 268435456 (2^28).
heads_length `50'
When the mode is heads, this specifies the number of characters that will be output for each document.
- maxdocs `all'
- The maximum number of documents to display in response to a query. This parameter may take on a numeric value between 1 and 429467295 (2^32 - 1) or the word all.
maxparas `
1000' The maximum number of paragraphs to identify during a ranked query with paragraph indexing. After the paragraphs have been identified, the paragraphs are converted into documents, and because some of the paragraphs may refer to the same documents the final number of answers may be less than maxparas. The maxdocs parameter will then be applied. This parameter may take on a numeric value between 1 and 429467295 (2^32 - 1).
max_accumulators `50000'
This parameter limits the number of different paragraph and document numbers to be accumulated during ranked queries when the parameter accumulator_method is set to splay_tree, hash_table, or list. This parameter may take any value between 8 and 268435456 (2^28).
max_terms `all' This parameter limits the number of terms that will actually be used during a ranked query. If more terms than the number specified by max_terms are entered, then the extra terms will be discarded. If sorted_terms is on, then the limiting will be done after the terms have been sorted. This parameter may take any value between 1 and 429467295 (2^32 - 1), or the word all.
- memstats `off'
- This is a Boolean parameter that determines whether the memory usage statistics for the preceding query will be displayed after each query. This parameter may take the values yes, no, true, false, on or off.
- mgdir `.'
- This is set to the directory where the mg(1) data files may be found. If the environment variable MGDATA exists, then this is instead initialised to the value of MGDATA. The value of this parameter may be changed, either in the .mgrc file with a .set mgdir directory command, or from the command line using the -d directory option. Once the ">" prompt appears, changing this parameter will have no effect.
mgname `
bible' This is set to the name of the
mg(1) collection that is to be used for the session.
The value of this parameter may be changed, either in the .mgrc file with a .set mgname name command, or from the command line using the -f name option. Once the ">" prompt appears, changing this parameter will have no effect.
- mode `text'
- This specifies how documents should be displayed when they are retrieved. It may take six different values: text, hilite, docnums, heads, silent, or count. text displays the contents of the document. hilite displays the contents of the document and highlights any of the stemmed query terms. docnums displays only the document numbers. heads is used to print out the head of each document. silent retrieves all the documents but displays nothing except how many documents were retrieved. This mode is intended to be used in timing experiments. count does the minimum amount of work required to determine how many documents would be retrieved, but does not retrieve them.
optimise_type `
1'
There are three types of boolean query optimisation (parse tree rearrangement). Type 0 leaves parse tree unaltered. Type 1 optimises for AND of terms and AND of OR of terms. Type 2 converts the tree into DNF (an experiment :-).
- pager `more'
- This is the name of the program that will be used to display the help and the retrieved documents. If the environment variable PAGER is defined, then pager takes on that value.
hilite_style `
bold'
This specifies the type of highlighting method. It may take one of two different values:
bold, or
underline.
para_sepstr `\n######## PARAGRAPH %n ########\n' This specifies the string that will be used to separate paragraphs. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\n'. To include a `%', use the sequence `%%'. To include the paragraph number within the document, use the sequence `%n'.
para_start `***** Weight = %w *****\n'
This specifies the string that will be used at the head of paragraphs for a paragraphlevel index following a ranked query. The standard C-language escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\n'. To include a `%', use the sequence `%%'. To include the paragraph weight, use the sequence `%w'.
- qfreq `true'
- This determine whether the ranked queries will take into account the number of times each query term is specified. When this is true, the number of times a term appears in the query is used in the ranking. When this is false, all query terms are assumed to occur only once. This parameter may take the values yes, no, true, false, on or off.
query `
Boolean' This specifies the type of queries that are to be specified. It can take four different values:
Boolean,
ranked,
docnums or
approx-
ranked.
Boolean is for Boolean queries. The
yacc(1) grammar for Boolean queries is as follows.
- query
- : or;
- or
- : or `|' and | and ;
- and
- : and `&' not | and not | not ;
- not
- : term | `!' not ;
- term
- : TERM | `(' or `)' ;
ranked and
approx-
ranked are for queries ranked by the cosine measure.
approxranked uses only the low-precision document lengths, and therefore only produces an approximation to full cosine ranking.
- query
- : TERM | query TERM ;
docnums allows the entry of document numbers. Multiple numbers separated by spaces may be specified, or ranges separated by hyphens.
- query
- : range | query range ;
- range
- : num | num `-' num ;
ranked_doc_sepstr `-------------------------------- %
n %
w\n' This specifies the string that will be used to separate documents when they are displayed for `ranked' or `approx-ranked' queries. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\n'. To include a `%', use the sequence `%%'. To include the
mg(1) document number, use the sequence `%n'. To include the document weight, use the sequence `%w'.
sizestats `false'
If this is true, then various numbers are output at the end of each query indicating what went on during the query. This parameter may take the values yes, no, true, false, on or off.
skip_dump `skips.%d'
If this parameter is set, then a file will be produced in the current directory during ranked queries on skipped inverted files when accumulator_method is set to splay_tree, hash_table, or list. The name of the file is the value of this parameter. A `%d' in the file name will be replaced with the process id of mgquery. This file will contain information about the usage of skips during the query processing. This option is expensive; use .unset skip_dump to obtain optimal performance.
sorted_terms `on'
This specifies whether or not the terms should be sorted into decreasing occurrence in documents so that the least-often occurring terms are processed first when ranked queries are being done. When this is true, the terms are sorted. When this is false, the terms are not sorted, and are instead processed in order of occurrence. This parameter may take the values yes, no, true, false, on or off.
stop_at_max_accum `on'
This specifies what should happen when the maximum number of accumulators set by max_accumulators is reached. When this is true, the processing of terms is stopped at the completion of the current term. When this is false, processing continues but no new accumulators are created. This parameter may take the values yes, no, true, false, on or off.
- terminator `'
- This specifies the string that will be output after the last document from the previous query has been output. The standard C escape character sequences may be used to place special characters in the string. For example, a newline would be written as `\n'. To include a `%', use the sequence `%%'.
timestats `
false'
If this is
true, then the time to process a query is displayed in both real time and CPU time. This parameter may take the values
yes,
no,
true,
false,
on or
off.
- verbatim `off'
- This is a Boolean parameter that determines whether the program should attempt to do a regular-expression match on the retrieved text. If verbatim is on and a postprocessing string is specified with the query, then the post-processing string will be searched for in the documents just before they are displayed. If the string is found, the document will be displayed; if not, the document will not be displayed. If verbatim is off, the post-processing string will be considered a regular expression as in egrep(1) or vi(1). E.g., if verbatim is on, "and.*the" will look for the 8-character sequence "and.*the". If verbatim is off, "and.*the" will look for the sequence "and" followed somewhere later in the document by the sequence "the". This parameter may take the values yes, no, true, false, on or off.
ENVIRONMENT
MGDATA If this environment variable exists, then its value is used as the default directory where the
mg(1) collection files are. If this variable does not exist, then the directory "." is used by default. The command line option -d
directory overrides the directory in MGDATA.
FILES
- .mgrc
- mgquery startup file
- help.mg
- Help file for mgquery. The contents of this file is displayed with the .help command.
- *.invf
- Inverted file.
- *.invf.dict
- The `on-disk' stemmed dictionary.
- *.text
- Compressed documents.
- *.text.dict
- Compression dictionary.
- *.text.idx
- Index into the compressed documents.
- *.text.idx.wgt
- Interleaved index into the compressed documents and document weights.
- *.weight.approx
- Approximate document weights.
SEE ALSO
egrep(1),
mg(1),
mg_compression_dict(1),
mg_fast_comp_dict(1),
mg_get(1),
mg_invf_dict(1),
mg_invf_dump(1),
mg_invf_rebuild(1),
mg_passes(1),
mg_perf_hash_build(1),
mg_text_estimate(1),
mg_weights_build(1),
mgbilevel(1),
mgbuild(1),
mgdictlist(1),
mgfelics(1),
mgstat(1),
mgtic(1), mgtic
build(1),
mgticdump(1),
mgticprune(1),
mgticstat(1),
vi(1),
yacc(1).
Table of Contents