
Multilayered networks and the
CG uncertainty principle
Paolo Frasconi and Marco Gori
Dipartimento di Sistemi e Informatica Universit?a di Firenze
Via di Santa Marta 3  50139 Firenze  Italy
[email protected]
ABSTRACT
The experience gained in many experiments with neural networks has shown that many challenging
problems are still hard to solve, since the learning process becomes very slow, often leading to suboptimal
solutions. In this paper we analyze this problem for the case of twolayered networks by
discussing on the joint behavior of the algorithm convergence and the generalization to new data. We
suggest two scores for generalization and optimal convergence that behave like conjugate variable in
Quantum Mechanics. As a result, the requirement of increasing the generalization is likely to affect
the optimal convergence. This suggests that difficult" problems are better faced with biasedmodels,
somewhat tuned on the task to be solved.
Index Terms Backpropagation, generalization, learning by examples, multilayered networks, pattern
recognition.
1. INTRODUCTION
Multilayered networks trained by Backpropagation12 have been widely used for applications to several
different subjects and particularly to pattern recognition4;10;14;17. Although remarkable results have
been obtained, there are still many difficult" problems where the learning process is very likely to get
stuck in suboptimal configurations with an excessive cost. One may wonder if the scaling up of the
network size can help solving those problems. The interpolation capabilities of multilayered networks
can be improved by increasing the number of hidden units8. Also the number of external inputs may
be increased. For example, in pattern recognition, one may choose the input representation by using a
preprocessing algorithm yielding a different number of features.
In order to analyze the scaling up of the networks, some researchers have used the quite natural framework
of Computational Complexity9;3. These studies lead to establish that the weight loading problem
is NPcomplete. This formalizes the experimental feeling reported in many accurate practical evaluations.
As pointed out by Baum2, however, we must give these conclusions a correct interpretation. It
would be unfair to conclude the intractability of the learning problem (referred to as loading problem
by Judd) without the specification of the basic hypotheses of these results. They rely on a somewhat
artificial protocol in which, for a given problem, an oracle is allowed to choose the architecture2. In this
way the researcher cannot choose himself the architecture of the network. In any case, these complexity
analyses provide a first significant warning to the effectiveness of learning algorithms for multilayered
networks.
Another way of studying the learning is that of looking into shape of the cost surface, in the attempt
of understanding how the stationary points are related to the network and the learning environment.
There are basically spurious and structural stationary points11. Spurious stationary points are due to
the joint assumption of the
euron squashing function" and the cost. Typically, we can get rid of
this kind of stationary points with proper choices of the squash and the cost. For example, the kind
of local minima proposed by Brady et al.5 can be avoided by choosing a cost based on asymptotic