
Where each ki is a positive constant. When used in a network, the rule becomes recursive, that is, ?ut is computed from ?xt+1; xt; ut. The learning rates for each ?ut are weighted by the probability of contribution. Lets assume for the sake of illustration that the flipflop has ended in state +1 at time T , that the target state was 1 and that jutj < 1 for T ? n < t <= T and uT?n > 1. There are two ways of udating ut to get the correct answer. Either ut < ?1 for some t in T ? n < t <= T or uT?n < 1. The learning rate should therefore be weighted by :5=n for T ? n < t <= T and by :5 for t = T ? n.
In order to adapt the threshold to the external level of noise and signal, the input to the unit is scaled by a weight w that can also be learnt using the target on u. Experiments were performed with this system on the minimal task described in Section 2. Sequence length were varied from 10 to 100 by increment of 10 and zeromean uniform noise with levels from 0.2 to 1.0 by increment of 0.2. We found the algorithm to converge in more than 99% of these cases (within a maximum of 100 epochs), independently of the noise level and the sequence length, in average within 4.6 epochs. This is in contrast to the recurrent sigmoid trained with backpropagation which was sensitive to the noise level and for which the probability of convergence decreased rapidly when increasing sequence length (as shown in Figure 2).
5 Conclusion
Recurrent networks are very powerful in their ability to represent context, often outperforming static networks [1]. However, we have presented theoretical and experimental evidence indicating that in many circumstances, simple gradientbased optimization of an error criterion may be inadequate to train them. There remains theoretical questions to be considered, such as whether the problem with simple gradient descent discussed in this paper would be observed with chaotic attractors that are not hyperbolic. A better understanding of the problem underlying the loss of information in the gradient may help us designing more efficient algorithms and/or architectures for recurrent neural networks. In fact, based on the specifications required to solve a minimal problem for such algorithms, we have introduced a new type of unit with its learning algorithm, that could learn to store one bit of information for an indefinite duration while being resistant to input noise. In any case, the results of this paper strongly suggest that different training algorithms should be considered for recurrent networks. Solutions to the challenge presented here to learning longterm dependencies with dynamical systems such as recurrent networks may have implications for many types of applications for learning systems, e.g., in language related problems, for which longterm dependencies are essential in order to take correct decisions.
References
[1] Y. Bengio, Artificial Neural Networks and their Application to Sequence Recognition," Ph.D. Thesis, McGill University, (Computer Science), 1991, Montreal, Qc., Canada.
[2] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe, Global Optimization of a Neural Network  Hidden Markov Model Hybrid," IEEE Transactions on Neural Networks, vol. 3, no. 2, 1992, pp. 252{259.
[3] P. Frasconi, M. Gori, and G. Soda, Local Feedback Multilayered Networks", Neural Computation 3, 1992, pp. 120{130.
[4] P. Frasconi, M. Gori, M. Maggini, and G. Soda, Unified Integration of Explicit Rules and Learning by Example in Recurrent Networks," IEEE Trans. on Knowledge and Data Engineering, in press.
[5] C.L. Giles and C.W. Omlin, Inserting Rules into Recurrent Neural Networks", to appear in IEEE Transactions Neural Networks & Signal Processing.
[6] C.M. Marcus, F.R. Waugh, and R.M. Westervelt, Nonlinear Dynamics and Stability of Analog Neural Networks", Physica D 51 (special issue), 1991, pp. 234247.
[7] Ortega J.M. and Rheinboldt W.C. Iterative Solution of Nonlinear Equations in Several Variables and Systems of Equations, Academic Press, New York, 1960.