
Backpropagation for linearlyseparable patterns:
a detailed analysis ?
Paolo Frasconi Marco Gori Alberto Tesi
Dipartimento di Sistemi e Informatica
Via di Santa Marta 3  50139 Firenze  Italy
email : [email protected]
Abstract In this paper we propose a sufficient condition for learning without local minima in multilayered networks. In particular we remove a fundamental assumption on the network architecture. We prove that the conclusions drawn in [10] also hold provided that the weight matrix associated with the hidden and output layer is pyramidal and has full rank. Moreover, the analysis is carried out by using LMSthreshold cost functions, which allow us to identify spurious and structural local minima. Index Terms  Backpropagation, local minima, multilayered networks, linearly separable patterns.
I. Introduction
Backpropagation is probably the most widely applied neural network learning algorithm. Backprop's popularity is related with its ability of dealing with complex multidimensional mappings. During the last few years, some research efforts have been devolved in explaining, from a theoretical point of view, the success of multilayered networks (MLN ) in solving practical problems. One important aspect concerns the multilayered networks' capability to learn arbitrary input/output mappings. This problem has been tackled in a number of ways, and steps have been done towards the definition of design criteria to be applied in order to meet the specifications. In particular, in [4, 5, 6] the universal interpolation capability of a large enough" MLN has been proven with just one hidden layer of sigmoidal units. Moreover, in [7], some bounds on the required number of hidden neurons are given, depending on the type of mapping to be realized. Another important research topic is that of investigating on the generalization to new examples (e.g. see [1]). In order to understand the effectiveness of learning algorithms in finding optimal solutions, the analysis of local minima of the cost function is also very important, no matter what kind of nu
?This research was partially supportedby MURST 40% and CNR Grant 90.01530.CT01.
merical method is used. A wide variety of cases exists in which Backprop is known to fail because of local minima ([3, 9, 10, 8, 15, 12]). The difficulty with Backprop is that, once the architecture is chosen, no general method exists that guarantees gradient descent to converge to the global minimum. Therefore, the investigation of the shape of the cost function is a very important issue for understanding the effectiveness of multilayered networks in solving practical problems. Some novel results have been proposed in [10]. In particular, sufficient conditions were given which guarantee Backprop to discover the optimal solution. These conditions concern both the training set and the network architecture. The former requires a linearly separable training set. The latter assumes a twolayered network, in which each output unit receives inputs only from a subset of hidden units, being such subsets disjoint.
In this paper we focus the analysis on stationary points of gradient (i.e. on the values of weights for which the gradient is null). We prove that, if the matrix of weights connecting hidden to output neurons is pyramidal and full rank, the result given in [10] can be extended to arbitrarily connected twolayer networks. Moreover, the analysis of the cost function is based on LMSThreshold cost functions [16] instead of the ordinary quadratic index. This study makes it possible to understand better the joint role of squashing and cost functions for stationary points. In [10], the assumption of asymptotic targets avoids the kind of local minima shown in [3]. The use of LMSThreshold functions represents a more general way to circumvent these kind of stationary points, referred to as spurious local minima in this paper.
The paper is organized as follows: in section 2 we propose a vectorial formulation of Backprop equations which turns out to be very useful for investigating the problem of local minima. In section 3 we prove that, under the hypothesis that the patterns are linearlyseparable, no local minima exist in the cost surface.