
On an algorithm for finding all interesting sentences
Extended abstract
Heikki Mannila?
MPI Informatik
Im Stadtwaldt
D66123 Saarbr?ucken, Germany
Hannu Toivoneny
University of Helsinki
Department of Computer Science
FIN00014 Helsinki, Finland
Abstract
Knowledge discovery in databases (KDD), also called data mining, has recently received wide attention from practitioners and researchers. One of the basic problems in KDD is the following: given a data set r, a class L of sentences defining subgroups or properties of r, and an interestingness predicate, find all sentences of L deemed interesting by the interestingness predicate. In this paper we analyze a simple and wellknown levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. We also consider the verification problem of a KDD process: given r and a set of sentences T ? L, determine whether T is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.
1 Introduction
Knowledge discovery in databases (KDD), also called data mining, has recently received wide attention from practitioners and researchers. There are several attractive application areas for KDD, and it seems that techniques from machine learning, statistics, and databases can be profitably combined to obtain useful methods and systems for KDD. See, e.g., [9; 27] for general descriptions of the area.
The KDD area is and should be largely guided by (succesful) applications. n this paper we take some steps towards theoretical KDD. Namely, we present a simple framework for KDD in which the task of knowledge discovery is defined to be finding all interesting ?On leave from the Department of Computer Science, University of Helsinki. Work supported by the Academy of Finland and the Alexander von Humboldt Stiftung. yWork supported by the Academy of Finland.
statements from a set of sentences. We study a simple breathfirst or levelwise algorithm for this task that has been used in various forms in machine learning and in several KDD appications, and we analyze the properties, especially the computational complexity, of this algorithm. Our results are not technically difficult, but they show some interesting connections between KDD algorithms for various tasks.
The model of knowledge discovery that we consider is the following. Given a database r, a language L for expressing properties or defining subgroups of the data, and an interestingness predicate q for evaluating whether a sentence ' 2 L defines an interesting subclass of r. The task is to find the theory of r with respect to L and q, i.e., the set Th(L; r; q) = f' 2 L j q(r; ') is trueg:
Note that we are not specifying any satisfaction relation for the sentences of L in r: this task is taken care of by the interestingness predicate q. For some applications, q(r; ') could mean that ' is true or almost true in r, or that ' defines (in some way) an interesting subgroup of r. The roots of this approach are in the use of diagrams of models in model theory (see, e.g., [5]). The approach has been used in various forms for example in [2; 6; 7; 13; 15; 18] One should note that in contrast with, e.g., [6] our emphasis is on very simple representation languages.
Obviously, if L is infinite and q(r; ') is satisfied for infinitely many sentences, (an explicit representation of) Th(L; r; q) cannot be computed. For the above formulation to make sense, the language L has to be defined carefully.
Example 1 Given a relation r with n rows over binaryvalued attributes R, an association rule [1] is an expression of the form X ) A, where X ? R and A 2 R. Denoting t(X) = 1 iff row t 2 r has a 1 in each column A 2 X, we define the support s(X) of X to be jft 2 r j t(X) = 1gj=n: The support of the rule X ) A is s(X [ fAg), and the confidence of the rule is s(X [ fAg)=s(X):
All rules with support higher than a given threshold can be effectively found by using a simple algorithm for finding frequent sets. A set X ? R is frequent, if