Multilayered networks and the
C-G uncertainty principle
Paolo Frasconi and Marco Gori
Dipartimento di Sistemi e Informatica Universit?a di Firenze
Via di Santa Marta 3 - 50139 Firenze - Italy
The experience gained in many experiments with neural networks has shown that many challenging
problems are still hard to solve, since the learning process becomes very slow, often leading to suboptimal
solutions. In this paper we analyze this problem for the case of two-layered networks by
discussing on the joint behavior of the algorithm convergence and the generalization to new data. We
suggest two scores for generalization and optimal convergence that behave like conjugate variable in
Quantum Mechanics. As a result, the requirement of increasing the generalization is likely to affect
the optimal convergence. This suggests that difficult" problems are better faced with biased-models,
somewhat tuned on the task to be solved.
Index Terms- Backpropagation, generalization, learning by examples, multilayered networks, pattern recognition.
Multilayered networks trained by Backpropagation12 have been widely used for applications to several
different subjects and particularly to pattern recognition4;10;14;17. Although remarkable results have
been obtained, there are still many difficult" problems where the learning process is very likely to get
stuck in sub-optimal configurations with an excessive cost. One may wonder if the scaling up of the
network size can help solving those problems. The interpolation capabilities of multilayered networks
can be improved by increasing the number of hidden units8. Also the number of external inputs may
be increased. For example, in pattern recognition, one may choose the input representation by using a
preprocessing algorithm yielding a different number of features.
In order to analyze the scaling up of the networks, some researchers have used the quite natural framework of Computational Complexity9;3. These studies lead to establish that the weight loading problem is NP-complete. This formalizes the experimental feeling reported in many accurate practical evaluations. As pointed out by Baum2, however, we must give these conclusions a correct interpretation. It would be unfair to conclude the intractability of the learning problem (referred to as loading problem by Judd) without the specification of the basic hypotheses of these results. They rely on a somewhat artificial protocol in which, for a given problem, an oracle is allowed to choose the architecture2. In this way the researcher cannot choose himself the architecture of the network. In any case, these complexity analyses provide a first significant warning to the effectiveness of learning algorithms for multilayered networks.
Another way of studying the learning is that of looking into shape of the cost surface, in the attempt of understanding how the stationary points are related to the network and the learning environment. There are basically spurious and structural stationary points11. Spurious stationary points are due to the joint assumption of the euron squashing function" and the cost. Typically, we can get rid of this kind of stationary points with proper choices of the squash and the cost. For example, the kind of local minima proposed by Brady et al.5 can be avoided by choosing a cost based on asymptotic