Learning without Local Minima in Radial Basis
Monica Bianchini, Paolo Frasconi, Student Member, IEEE, and Marco Gori, Member, IEEE
Abstract| Learning from examples plays a central role in artificial neural networks (ANN). However, the success of many learning schemes is not guaranteed, since algorithms like Backpropagation (BP) may get stuck in local minima, thus providing sub-optimal solutions. For feedforward networks, the theoretical results reported in [5,6,15,20] show that optimal learning can be achieved provided that certain conditions on the network and the learning environment are met. A similar investigation is put forward in this paper for the case of networks using radial basis functions (RBF) [10,14]. The analysis proposed in  is extended naturally under the assumption that the patterns of the learning environment are separable by hyperspheres. In that case, we prove that the attached cost function is local minima free with respect to all the weights. This provides us with some theoretical foundations for a massive application of RBF in pattern recognition.
Keywords| Backpropagation, multilayered networks, pattern recognition, radial basis functions.
In the last few years there has been renewed interest in
artificial neural networks which have been used for several
different applications. For example, in pattern recognition
the remarkable experience gained by numerous experiments
suggests that ANNs are very promising tools.
However, in order to exploit the potential learning capabilities
of these techniques, a more thorough knowledge of the
learning algorithms seems really necessary. Many of these
algorithms essentially deal with the optimization of functions
that, in pattern recognition, may be based on several
thousand variables. The rich literature on optimization algorithms
still has to be exploited in a suitable way in order
to deal effectively with such functions. There is, however,
no doubt that the shape of those cost functions deeply affects
the success of any candidate algorithm. The presence
of stationary points, and particularly of local minima, may
severely limit the effectiveness of any neural network learning
scheme. Although many suggestions have been given
for avoiding local minima (see e.g.: [18,21]), their presence
represents a serious problem from the computational point
of view. For local minima free cost functions just simple
gradient descent algorithms allow us to discover optimal
solutions with a limited computational burden. This motivates
the research of conditions for guaranteeing the absence
of local minima. To some extent, such conditions give
us an idea of the difficulties of the problem we are dealing
In the case of feedforward networks, analyses on optimal
The authors are with Dipartimento di Sistemi e Informatica, Universit?a di Firenze (Italy).
learning have been carried out by numerous researchers in the attempt to find examples of local minima and conditions which imply local minima free error surfaces. A general analysis on that problem can be found in  which also ensures that the error surfaces are local minima free in the case of pyramidal networks.
There have also been attempts to establish straightforward sufficient conditions for guaranteeing the absence of local minima. Roughly speaking BP convergence is guaranteed for many input and many hidden networks. In , the local minima analysis is specialized for the case of linearly separable patterns. Linear separability is likely to hold for patterns represented by many coordinates." This hypothesis does not require constraints on the network architecture and particularly on the number of hidden units. As a consequence, a good generalization can be attained on the test set. The limitation, of course, is with the condition itself, which can be met in some problems of pattern recognition, but certainly not in general. Moreover, the use of hidden units for dealing with linearly separable patterns may seem unnecessary, since a one-layer network suffices for separating the learning set1.
Poston et al.  have shown that the absence of local minima is guaranteed for networks having one hidden layer and as many hidden units as patterns. Yu  independently obtained an interesting result that contains Poston's result as a particular case. In practice, the conditions they assume obviously lead to nets with many hiddens." In the case of many hidden networks," the absence of local minima can be ensured without making assumptions about geometrical or statistical properties of data. Unfortunately, the resulting architectures are not likely to generalize very well to new examples because of the large number of trainable parameters. It is worth mentioning that the condition assumed in  is slightly less restrictive than the one proposed in , since it establishes that the number of hidden units required for ensuring local minima free error surface is equal to the number of different inputs. Although it seems unlikely to have coincident inputs in practical applications, Yu's analyses may also be useful when considering clustering" of inputs, at least when those clusters are nearly reduced to single points.
Recently, many researchers have used RBF for a wide range of applications and have also proposed comparisons to feedforward nets (see e.g.: ). To the best of our
1As pointed out in , however, the generalization to new examples is better for networks with a hidden layer.