
Diffusion of Credit in Markovian Models
DRAFT { To appear in NIPS 7
Yoshua Bengio
Dept. I.R.O., Universit?e de Montr?eal,
Montreal, Qc, Canada H3C3J7
[email protected]
Paolo Frasconi
Dipartimento di Sistemi e Informatica
Universit?a di Firenze, Italy
[email protected]
http://wwwdsi.ing.unifi.it/?paolo/
Abstract
This paper studies the problem of diffusion in Markovian models (such as hidden Markov models) and how it makes very difficult the task of learning of longterm dependencies in sequences.
1 Introduction
This paper is part of our research on the problem of learning longterm dependencies in sequences. In our previous work (Bengio, Simard & Frasconi, 1994) we found theoretical reasons for the difficulty in training recurrent networks (or more
generally parametric dynamical systems) to learn longterm dependencies. The main result stated that either longterm storing or gradient propagation would be harmed, depending on whether the norm of the Jacobian of the state to state function was greater or less than 1. In this paper we consider a special case in which the norm of the Jacobian of the state to state function is constrained to be exactly 1 because this matrix is a stochastic matrix. This paper thus deals with learning longterm dependencies in systems that have this property, i.e. Markovian models, such as hidden Markov models (HMM).
Consider a stationary or nonstationary Markov process with n states and transition matrices At (constant in the stationary case), Aij(ut) = P(qt = j j qt?1 = i; ut; ?)
where ut is an external input (constant in the stationary case) and ? is a set of parameters. By associating an output distribution to each state, such models can learn the distribution of output sequences. In the nonstationary case, transition and output distributions are conditional on the input sequences, allowing to model relationships between input and output sequences (Bengio & Frasconi, 1994b) (e.g. to do regression or classification). We thus call Input Output HMM (IOHMM) this kind of nonstationary HMM. The EM algorithm can be applied to maximize the data likelihood for both HMMs and IOHMMs. In both cases, training requires