page 1  (12 pages) 2

Appeared in AAAI Workshop on Knowledge Discovery in Databases, Eds. Usama M. Fayyad and Ramasamy Uthurusamy, pages 181 { 192, Seattle, Washington, July 1994.

Efficient Algorithms for Discovering

Association Rules

Heikki Mannila Hannu Toivonen? A. Inkeri Verkamo

University of Helsinki, Department of Computer Science

P.O. Box 26 (Teollisuuskatu 23), FIN-00014 Helsinki, Finland e-mail: fmannila, htoivone, [email protected]

Abstract

Association rules are statements of the form for 90 % of the rows of the relation, if the row has value 1 in the columns in set W , then it has 1 also in column B". Agrawal, Imielinski, and Swami introduced the problem of mining association rules from large collections of data, and gave a method based on successive passes over the database. We give an improved algorithm for the problem. The method is based on careful combinatorial analysis of the information obtained in previous passes; this makes it possible to eliminate unnecessary candidate rules. Experiments on a university course enrollment database indicate that the method outperforms the previous one by a factor of 5. We also show that sampling is in general a very efficient way of finding such rules.
Keywords: association rules, covering sets, algorithms, sampling.

1 Introduction

Data mining (database mining, knowledge discovery in databases) has recently been recognized as a promising new field in the intersection of databases, artificial intelligence, and machine learning (see, e.g., [11]). The area can be loosely defined as finding interesting rules or exceptions from large collections of data.

Recently, Agrawal, Imielinski, and Swami introduced a class of regularities, association rules, and gave an algorithm for finding such rules from a database with binary data [1]. An association rule is an expression W ) B, where W is a set of attributes and B a single attribute. The intuitive meaning of such a rule is that in the rows of the database where the attributes in W have value true, also the attribute B tends to have value true. For instance, a rule might state that students taking courses CS101 and CS120, often also take the course CS130. This sort of information can be used, e.g., in assigning classrooms for the courses. Applications of association rules include customer behavior

?On leave from Nokia Research Center.

181