page 1  (8 pages)
2to next section

Learning without Local Minima in Radial Basis

Function Networks

Monica Bianchini, Paolo Frasconi, Student Member, IEEE, and Marco Gori, Member, IEEE

Abstract| Learning from examples plays a central role in artificial neural networks (ANN). However, the success of many learning schemes is not guaranteed, since algorithms like Backpropagation (BP) may get stuck in local minima, thus providing sub-optimal solutions. For feedforward networks, the theoretical results reported in [5,6,15,20] show that optimal learning can be achieved provided that certain conditions on the network and the learning environment are met. A similar investigation is put forward in this paper for the case of networks using radial basis functions (RBF) [10,14]. The analysis proposed in [6] is extended naturally under the assumption that the patterns of the learning environment are separable by hyperspheres. In that case, we prove that the attached cost function is local minima free with respect to all the weights. This provides us with some theoretical foundations for a massive application of RBF in pattern recognition.

Keywords| Backpropagation, multilayered networks, pattern recognition, radial basis functions.

I. Introduction

In the last few years there has been renewed interest in artificial neural networks which have been used for several different applications. For example, in pattern recognition the remarkable experience gained by numerous experiments suggests that ANNs are very promising tools. However, in order to exploit the potential learning capabilities of these techniques, a more thorough knowledge of the learning algorithms seems really necessary. Many of these algorithms essentially deal with the optimization of functions that, in pattern recognition, may be based on several thousand variables. The rich literature on optimization algorithms still has to be exploited in a suitable way in order to deal effectively with such functions. There is, however, no doubt that the shape of those cost functions deeply affects the success of any candidate algorithm. The presence of stationary points, and particularly of local minima, may severely limit the effectiveness of any neural network learning scheme. Although many suggestions have been given for avoiding local minima (see e.g.: [18,21]), their presence represents a serious problem from the computational point of view. For local minima free cost functions just simple gradient descent algorithms allow us to discover optimal solutions with a limited computational burden. This motivates the research of conditions for guaranteeing the absence of local minima. To some extent, such conditions give us an idea of the difficulties of the problem we are dealing with.
In the case of feedforward networks, analyses on optimal

The authors are with Dipartimento di Sistemi e Informatica, Universit?a di Firenze (Italy).

learning have been carried out by numerous researchers in the attempt to find examples of local minima and conditions which imply local minima free error surfaces. A general analysis on that problem can be found in [6] which also ensures that the error surfaces are local minima free in the case of pyramidal networks.

There have also been attempts to establish straightforward sufficient conditions for guaranteeing the absence of local minima. Roughly speaking BP convergence is guaranteed for many input and many hidden networks. In [6], the local minima analysis is specialized for the case of linearly separable patterns. Linear separability is likely to hold for patterns represented by many coordinates." This hypothesis does not require constraints on the network architecture and particularly on the number of hidden units. As a consequence, a good generalization can be attained on the test set. The limitation, of course, is with the condition itself, which can be met in some problems of pattern recognition, but certainly not in general. Moreover, the use of hidden units for dealing with linearly separable patterns may seem unnecessary, since a one-layer network suffices for separating the learning set1.

Poston et al. [15] have shown that the absence of local minima is guaranteed for networks having one hidden layer and as many hidden units as patterns. Yu [20] independently obtained an interesting result that contains Poston's result as a particular case. In practice, the conditions they assume obviously lead to nets with many hiddens." In the case of many hidden networks," the absence of local minima can be ensured without making assumptions about geometrical or statistical properties of data. Unfortunately, the resulting architectures are not likely to generalize very well to new examples because of the large number of trainable parameters. It is worth mentioning that the condition assumed in [20] is slightly less restrictive than the one proposed in [15], since it establishes that the number of hidden units required for ensuring local minima free error surface is equal to the number of different inputs. Although it seems unlikely to have coincident inputs in practical applications, Yu's analyses may also be useful when considering clustering" of inputs, at least when those clusters are nearly reduced to single points.

Recently, many researchers have used RBF for a wide range of applications and have also proposed comparisons to feedforward nets (see e.g.: [10]). To the best of our

1As pointed out in [9], however, the generalization to new examples is better for networks with a hidden layer.