page 1  (9 pages)
2to next section

Diffusion of Credit in Markovian Models

DRAFT { To appear in NIPS 7

Yoshua Bengio
Dept. I.R.O., Universit?e de Montr?eal,
Montreal, Qc, Canada H3C-3J7
[email protected]

Paolo Frasconi
Dipartimento di Sistemi e Informatica
Universit?a di Firenze, Italy
[email protected]


This paper studies the problem of diffusion in Markovian models (such as hidden Markov models) and how it makes very difficult the task of learning of long-term dependencies in sequences.

1 Introduction

This paper is part of our research on the problem of learning long-term dependencies in sequences. In our previous work (Bengio, Simard & Frasconi, 1994) we found theoretical reasons for the difficulty in training recurrent networks (or more

generally parametric dynamical systems) to learn long-term dependencies. The main result stated that either long-term storing or gradient propagation would be harmed, depending on whether the norm of the Jacobian of the state to state function was greater or less than 1. In this paper we consider a special case in which the norm of the Jacobian of the state to state function is constrained to be exactly 1 because this matrix is a stochastic matrix. This paper thus deals with learning long-term dependencies in systems that have this property, i.e. Markovian models, such as hidden Markov models (HMM).

Consider a stationary or non-stationary Markov process with n states and transition matrices At (constant in the stationary case), Aij(ut) = P(qt = j j qt?1 = i; ut; ?)

where ut is an external input (constant in the stationary case) and ? is a set of parameters. By associating an output distribution to each state, such models can learn the distribution of output sequences. In the non-stationary case, transition and output distributions are conditional on the input sequences, allowing to model relationships between input and output sequences (Bengio & Frasconi, 1994b) (e.g. to do regression or classification). We thus call Input Output HMM (IOHMM) this kind of non-stationary HMM. The EM algorithm can be applied to maximize the data likelihood for both HMMs and IOHMMs. In both cases, training requires