page 1  (20 pages)
2to next section

Fast Information Retrieval Using Compressed
Multi-Fragmented Signature Files

Seyit KOCBERBER1 Fazli CAN2*

1 Department of Computer Engineering and Information Science, Bilkent University
Bilkent, 06533 Ankara, Turkey

2 Department of Systems Analysis, Miami University, Oxford, OH 45056, USA
e-mails: [email protected], [email protected]

BU-CEIS-9625

November 21, 1996

Abstract: A new file method, called Compressed Multi-Fragmented Signature File (C-MFSF), that uses a partial query evaluation strategy with compressed signature bit slices is presented. The experiments show that for queries with more than one term, C-MFSF obtains the query results with fewer disk accesses than the inverted file approach. For single term queries, which is rare for very large databases, it requires more disk accesses than the inverted file method while its response time is still satisfactory. The number of disk accesses is the same as the number of query terms for queries with more than two terms. The experiments with a real database of 152,850 records validate the simulation results and are used to project the response time for very large databases. For a database of one million records, the projected response times for single term and multiterm queries are less than 2 and 1.14 seconds, respectively.

1. INTRODUCTION

Signature file approach is a well-known indexing technique for information retrieval [FAL92]. In this approach, the content of a record (an instance of any kind of data will be referred to as a record) is encoded in a bit string called record signature. During the generation of signatures each term (an attribute of a record, without loss of generality, will be referred to as a term) is hashed into a bit string of size F by setting S bits to ?1? (on-bit) where F S
> . The result is called a term signature. Record signatures are obtained either by concatenating or superimposing the signatures of the record terms. In this paper we consider only superimposed signatures and conjunctive queries.

Several signature generation and signature file methods have been proposed to obtain a desirable response time and space overhead. A survey of the proposed methods can be found in [AKT93, FAL92]. In the superimposed signature approach the length of the record signature (F) and term signatures are the same and F >> S. Similar to records, query signatures are obtained by superimposing the query term signatures.

* To whom all correspondence should be addressed voice: (513) 529-5950, fax: (513) 529-1524