page 1  (18 pages)
2to next section

Decision Trellis Models for Tuple Categorization

in Databases

Paolo Frasconi1, Marco Gori2, and Giovanni Soda1

1 Dipartimento di Sistemi e Informatica. Universit?a di Firenze
Via di Santa Marta 3 { 50139 Firenze (Italy)
fpaolo|[email protected]
2 Dipartimento di Ingegneria dell'Informazione. Universit?a di Siena
Via Roma 56 { 53100 Siena (Italy)
[email protected]

Abstract. We introduce a probabilistic graphical model for supervised learning on databases with categorical attributes. The proposed graph contains hidden variables that play a role similar to nodes in decision trees and each of their states either corresponds to a class label or to a single attribute test. As a major difference with respect to decision trees, the selection of the attribute to be tested is probabilistic. Thus, the architecture can be used to assess the probability that a tuple belongs to some class, given the predictive attributes. The training algorithm can be easily derived in the general framework of graphical models, using expectation-maximization (EM) for finding the optimal parameters. We propose decision trellises as an alternative to decision trees in the context of tuple categorization in databases, which is an important step for building data mining systems. Preliminary experiments on some standard databases are reported, comparing the classification accuracy of decision trellises and decision trees induced by C4.5.

1 Introduction

Knowledge discovery from databases (KDD) is a relatively recent area of research aiming at the exploration of large amounts of relatively raw data to extract useful, valid, and understandable patterns [10]. As KDD relies upon data mining tools developed within many (seemingly distant) disciplines, it has recently received increased interest from different research communities, such as databases, artificial intelligence, knowledge engineering, statistics, and machine learning. Database technology, for example, could beneficiate of KDD thanks to the inductive (learned from data) inference paradigm, as opposite to the more traditional deductive (rule-based) inference paradigm that characterizes most of the DBMS designed so far [15]. Expert systems engineers hope to overcome the knowledge acquisition bottleneck by exploring automatic or semiautomatic techniques. Statisticians and machine learning researchers look at data mining as to a challenging task providing interesting testbeds for advanced algorithms and learning architectures.