
Discovering generalized episodes using minimal occurrences
Heikki Mannila and Hannu Toivonen
University of Helsinki
Department of Computer Science
P.O. Box 26, FIN00014 Helsinki, Finland
[email protected], [email protected]
Abstract
Sequences of events are an important special form of data that arises in several contexts, including telecommunications, user interface studies, and epidemiology. We present a general and flexible framework of specifying classes of generalized episodes. These are recurrent combinations of events satisfying certain conditions. The framework can be instantiated to a wide variety of applications by selecting suitable primitive conditions. We present algorithms for discovering frequently occurring episodes and episode rules. The algorithms are based on the use of minimal occurrences of episodes; this makes it possible to evaluate confidences of a wide variety of rules using only a single analysis pass. We present empirical results on the behavior of the algorithms on events stemming from a WWW log.
Introduction
Sequences of events are a common form of data that can contain important knowledge to be discovered. Examples of such data are telecommunications network alarms, user interface actions, crimes committed by a person, occurrences of recurrent illnesses, etc. Recently, interest in knowledge discovery from sequences of events has increased: see, e.g., (Dousson, Gaborit, & Ghallab 1993; Laird 1993; Wang et al. 1994; Morris, Khatib, & Ligozat 1995; Bettini, Wang, & Jajodia 1996).
In a previous paper (Mannila, Toivonen, & Verkamo 1995) we showed how sequences of events can be analyzed by locating frequently occurring episodes from them. An episode is a combination of events with a partially specified order; it occurs in a sequence, if there are occurrences of the events in an order consistent with the given order, within a given time bound.
In a telecommunication application, the rules found using the methods of (Mannila, Toivonen, & Verkamo 1995) have proven to be useful and they have been integrated in alarm handling software (H?at?onen et al.
1996). However, application studies both in the telecommunications domain and elsewhere have shown that there is a need for extensions to the methods.
In this paper we study what types of episodes can be efficiently discovered from long sequences of events. We present a simple and powerful framework for building episodes out of predefined predicates, and describe how such episodes can be discovered efficiently. The framework allows one to express arbitrary unary conditions on the individual events, and to pose binary conditions on the pairs of events, giving one the possibility to exactly target the types of event combinations that are interesting in the applications.
The algorithms described in the paper are based on minimal occurrences of episodes. In addition to being simple and efficient, this formulation has the advantage that the confidences and frequencies of rules with different time bounds can be obtained quickly, i.e., there is no need to rerun the analysis if one only wants to modify the time bounds. In case of complicated episodes, the time needed for recognizing the occurrence of an episode can be significant; the use of stored minimal occurrences of episodes eliminates unnecessary repetition of the recognition effort.
Episodes: patterns in event sequences
We use a fairly standard way of modeling events in
time. Given the set R = fA1; : : : ; Amg of event attributes
with domains DA1 ; : : : ; DAm , an event e over R is
a (m + 1)tuple (a1; : : : ; am; t), where ai 2 DAi and t
is a real number, the time of e. We refer to the time of
e by e:T , and to an attribute A 2 R of e by e:A. An
event sequence S is a collection of events over R, i.e.,
a relation over R[ fTg, where the domain of attribute
T is the set of real numbers.
An episode P on variables fx1; : : : ; xkg, denoted
P (x1; : : : ; xk), is a conjunction
k^
i=1
'i(yi; zi);