
Copyright ? 1996 by Leonard G. C. HameyAppears in Proc. Seventh Australian Conf. Artificial Neural Networks,pages 173178, 1996.
Results on Weight Configurations that are not Local Minima
in FeedForward Neural Networks
Leonard G. C. Hamey
School of MPCE
Macquarie University
NSW 2109 Australia
ABSTRACT
Local minima in the error surfaces of feedforward neural networks are significant because they may entrap gradient based training algorithms. Recent results have identified conditions under which local minima do not occur. The present paper considers three distinct definitions of local minimum, concluding that a new definition, called regional minimum, corresponds most closely to intuition. Using this definition, we analyse weight configurations in which a hidden node is ignored or redundant and show that these are not local minima. The practical implications of this result for gradient based learning are discussed.
1. Introduction
Local minima in the error surfaces of feedforward neural networks are of interest because they may entrap training algorithms based upon gradient descent (Rumelhart, Hinton and Williams, 1986). Global search techniques such as simulated annealing (Kirkpatrick, Gelatt and Vecchi, 1983) are often too computationally expensive for realworld problems.
Practical experience suggests that local minima are uncommon (Rumelhart et al., 1986). Theoretical results are now being obtained showing that local minima do not occur under suitable restricted conditions (Hamey, 1994; Gori and Tesi, 1992; Poston, Lee, Choie and Kwon, 1991; Sontag and Sussmann, 1991; Yu, 1992).
Recent theoretical analysis of the exclusiveor problem by Hamey (1995a; 1996; 1995b) and SprinkhuizenKuyper and Boers (1994a; 1994b; 1994c) has employed detailed case analysis to show that, contrary to longheld belief, this problem does not exhibit local minima. In particular, Blum's (1989) analytical error has been corrected (Hamey, 1995b; SprinkhuizenKuyper and Boers, 1994b).
In the present work, general results are obtained showing that weight configurations are not local minima if a hidden node is linearly redundant with the other hidden nodes (including the bias node) or has a zero connecting weight to the output nodes. This means that `stray' and `ignored' hidden nodes (Wessels, Barnard and van Rooyen, 1991) are not evidence of local minima and that the possibility exists for a gradient descent learning algorithm to recover such hidden nodes and employ
them to improve the network error. At the worst, such situations are plateaus in the error surface.
2. Definitions of Local Minimum
The analysis of local minima clearly hinges upon the definition of local minimum. In this section, we consider three possible definitionsrelative, strict relative and regional minima. Informally, a point on the error surface is a relative minimum if no nearby points have lower error than the point itself while it is a strict relative minimum if all nearby points have greater error. The third definition, regional minimum, says that a point represents a regional local minimum if there is no path from that point to a global minimum which does not increase the error at some point along the way.
Formally, let g : D ! IR be a scalar differentiable function on domain D. In the case of feedforward neural networks, g is the error surface and D is the weight space.
A point w0 is a relative minimum if g(w0) <= g(w) for all w in a neighbourhood of w0; i.e. for all w : jw ?w0j < ffl for positive ffl. If g(w0) < g(w) for all w <> w0 in the neighbourhood then w0 is a strict relative minimum. These definitions are wellknown and commonly used (Luenberger, 1984).
Typically, neural network researchers use the term local minimum" to refer to suboptimal relative minima. However, this definition is counterintuitive in two aspects. Firstly, any point where the error surface is locally flat is a relative minimum; i.e. the definition makes no distinction between a plateau and a true minimum. Such a dis