An EM Approach to Grammatical Inference: Input/Output HMMs
Paolo Frasconi Yoshua Bengio
Dipartimento di Sistemi Dept. Informatique et
e Informatica Recherche Op?erationnelle
Universit?a di Firenze Universit?e de Montr?eal
50139 Firenze (Italy) Montreal, Qc H3C-3J7
We propose a modular recurrent connectionist architecture for adaptive temporal processing. The model is given a probabilistic interpretation and is trained using the EM algorithm. This model can also be seen as an Input/Output Hidden Markov Model. The focus of this paper is on sequence classification tasks. We demonstrate that EM supervised learning is well suited for solving grammatical inference problems. Experimental benchmark results are presented for the seven Tomita grammars, showing that these adaptive models can attain excellent generalization.
Challenging learning tasks, such as those related to language, can in principle be approached with recurrent neural networks. Recurrent networks can store and retrieve information in a flexible way. In particular, dynamical attractors can be used to implement reliable long-term memories. However, practical difficulties have been reported in training recurrent networks to perform tasks involving long-term dependencies. A mathematical analysis of the problem shows that either one of two conditions arises in such systems . In the first situation, the dynamics of the network allow it to reliably store bits of information, but gradients vanish exponentially fast as one propagates them backward in time. In the second situation, gradients can flow backward but the system is locally unstable and cannot reliably store bits of information in the presence of input noise. In practice, even if the task does not exhibit very long durations in the dependencies to be captured, the solution found by backpropagation tends to reflect mostly short-term dependencies, thus leading to suboptimal training and generalization.
In this paper we focus on grammatical inference. In this task the learner is presented a set of labeled strings and is requested to infer a set of rules that define a formal language. It can be considered as a prototype for more complex language processing problems. However, even in the simplest" case, i.e. regular grammars, the task can be proved to be NP- complete . Many researchers [3, 4, 5] have approached grammatical inference with recurrent networks. These studies demonstrate that second-order networks can be trained to approximate the behavior of finite state automata (FSA). However, memories learned in this way appear to lack robustness and noisy dynamics become dominant for long input strings. This has motivated research to extract automata rules from the trained network [3, 5]. In many cases, it has been shown that the extracted automaton outperforms the trained network. Although FSA extraction procedures are relatively easy to devise for symbolic inputs, they may be more difficult to apply in tasks involving a subsymbolic or continuous input space, such as in speech recognition. Moreover, the complexity of the discrete state space produced by the FSA extraction procedure may grow intolerably if the continuous network has learned a representation involving chaotic attractors. Other researchers have attempted to encourage a finite-state representation via regularization  or by integrating clustering techniques in the training procedure .
Following previous work , in this paper we propose a training method that propagates a discrete distribution of targets rather than differential error information. We assume a discrete distribution of states and we introduce a modular recurrent architecture that associates subnetworks to discrete states. This new model can also be seen as a particular type of hidden Markov model with inputs as well as outputs, hence the name IOHMM (Input/Output HMM). With this probabilistic interpretation a training algorithm