page 1  (6 pages)
2to next section

Proceedings of The First International Conference on Knowledge Discovery and Data Mining (KDD-95), pages 210 { 215, Montreal, Canada, August 20{21, 1995.

Discovering frequent episodes in sequences

Extended abstract

Heikki Mannila and Hannu Toivonen and A. Inkeri Verkamo Department of Computer Science
P.O.Box 26, FIN{00014 University of Helsinki, Finland fmannila, htoivone, [email protected]

Abstract

Sequences of events describing the behavior and actions of users or systems can be collected in several domains. In this paper we consider the problem of recognizing frequent episodes in such sequences of events. An episode is defined to be a collection of events that occur within time intervals of a given size in a given partial order. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We describe an efficient algorithm for the discovery of all frequent episodes from a given class of episodes, and present experimental results.

Introduction

Most machine learning and data mining techniques are adapted towards the analysis of unordered collections of data. However, there are important application areas where the data to be analyzed has an inherent sequential structure.

For instance, in telecommunications network monitoring or empirical user interface studies it is easy to log a lot of information about the behavior and actions of the user and the system. Abstractly such a log can be viewed as a sequence of events, where each event has an associated time of occurrence. An example of an event sequence is

(A; 123); (B; 125); (D; 140); (A; 150); (C; 151); (B; 155); (D; 201); (A; 220); (D; 222); (B; 225):

Here A, B, and C are event types (e.g., different types of notifications in the network or different types of user actions), and the numbers indicate the times when the events occurred.

One basic problem in analyzing such a sequence is to find frequent episodes, i.e., collections of events occurring frequently close to each other. For example, in the above sequence the episode A followed by B" occurs several times, even when the sequence is viewed through a window of (say) 6 time units.

Episodes, in general, are partially ordered sets of events. They can also be described as directed acyclic

graphs in the obvious way. Consider, for instance, the episodes (a), (b), and (c):

A B C

(a)

A

B

C

(b)

B

A

D

C

(c)

Episode (a) is a serial episode: it occurs in a sequence only if there are events A, B, and C that occur in this order in the sequence relatively close together. Note that in the sequence there can be other events occurring between these three. Episode (b) is a parallel episode: no constraints on the relative order of A, B, and C are given. Episode (c) occurs in a sequence if there are occurrences of A and B and these precede the occurrences of C and D; no constraints on the relative order of A and B (or C and D) are given. All four occurrences must be close to each other in time.

The user defines how close is close enough by giving the width of the time window within which the episode must occur. The user also specifies how often an episode has to occur to be considered frequent.

In the analysis of sequences the user is interested in finding all frequent episodes from a class of episodes. Such a class can be, e.g., all serial episodes, all parallel episodes, or all episodes where the partial order matches some part of the network topology; one can even consider the class of all partial orders.

Once the episodes are known, they can be used to obtain rules for prediction. For example, if we know that the serial episode (a) occurs in 4.0 % of the windows and that the serial subepisode consisting only of A and B occurs in 4.2 % of the windows, we can estimate that after seeing a window where A and B occur in this order, there is a chance of 0.95 that C will occur in the same window. One can compute such strengths of rules using the information obtained by our algorithm; it is also easy to obtain confidence intervals for the rule strengths.

210