page 1  (21 pages)
2to next section

Generalized Vertical Partitioning of Signature Files

Seyit KOCBERBER1 Fazli CAN2*

1 Department of Computer Engineering and Information Science, Bilkent University
Bilkent, 06533 Ankara, Turkey

2 Department of Systems Analysis, Miami University, Oxford, OH 45056, USA
e-mails: [email protected], [email protected]

BU-CEIS-9513

September 18, 1995

Abstract : A new signature file method, called Generalized Multi-Fragmented Signature File (GMFSF), is presented. The new method provides a unified framework for other vertical partitioning signature files. The performance of GMFSF is measured with a prototype information retrieval system using a database of 152,850 MARC records. The experimental results agree with the theoretical analysis. The response time of GMFSF decreases with an increasing number of query terms and is independent of the number of hits to the query. These features make the method competitive with inverted files, and the experiments further show that GMFSF performs better than inverted files for the non-zero hit conjunctive queries with more than three terms.

1. INTRODUCTION
Recent developments in the data storage technology, e.g., optical disks, enable the storage of formatted and unformatted data, such as text, voice and image in the same database. The growing size of the databases necessitates the development of efficient file structures and search techniques for such multimedia environments [1,9].

Signature files provide a space efficient fast search structure by eliminating most of the irrelevant records by comparing the record signatures and the query signature without retrieving the actual records. In this paper, an instance of any kind of data will be referred to as a record. An attribute of a record, without loss of generality, will be referred to as a term. In signature approach, record terms are encoded in a bit string called a record signature. During the generation of signatures each term is hashed into a bit string of size F by setting m bits to 1 (on-bit) where F m
> . The result is called a term
signature. Record signatures are obtained either by concatenating or superimposing the signatures of the record terms.

In superimposed signature files, the length of the record signature (F) and term signatures are the same and F >> m. For a database of N records, the signature file can be

* To whom all correspondence should be addressed voice: (513) 529-5950, fax: (513) 529-1524