A Hybrid Model for Learning Sequential Navigation
Ron Sun, Todd Peterson
University of Alabama, Tuscaloosa, AL 35487
To deal with reactive sequential decision tasks, we present a learning model Clarion, which is a hybrid connectionist model consisting of both localist and distributed representations, based on the two-level approach proposed in Sun (1995). The model learns and utilizes procedural and declarative knowledge, tapping into the synergy of the two types of processes. It unifies neural, reinforcement, and symbolic methods to perform on-line, bottom-up learning. Experiments in various situations are reported that shed light on the working of the model.
This paper presents a hybrid model that unifies neural, symbolic, and reinforcement learning into an integrated architecture. It addresses the following three issues: (1) It deals with concurrent on-line learning: It allows a situated agent to learn continuously from on-going experience in the world, without the use of preconstructed data sets or preconceived concepts. (2) The model learns not only low-level specific skills but also high-level generic (declarative) knowledge (which is beyond traditional reinforcement learning algorithms as will be discussed later). (3) The learning is bottom-up: generic knowledge is acquired from an agent's experience interacting with the world through the mediation of low-level skills. This differs from top-down learning in which low-level knowledge is acquired through compiling" mostly externally given high-level knowledge (Anderson 1983). Although the essential motivation for this model is cognitive modeling this paper will focus only on computational experiments.
Reactive sequential decision tasks (Sun and Peterson 1995) involve selecting and performing a sequence of actions step by step on the basis of the current state, or the moment-to-moment perceptual information (hence the term reactive"). At certain points, the agent may receive payoffs or reinforcements for their actions performed at or prior to the current state. The agent may
Figure 1: Navigating Through A Minefield
want to maximize the total payoffs. Thus, the agent may need to perform credit assignment, to attribute the payoffs/reinforcements to actions at various points in time (the temporal credit assignment problem), in accordance with various aspects of a state (the structural credit assignment problem). There is in general no teacher input. The agent starts with little or no a priori knowledge. One example involves learning to navigate through mines (see Figure 1).
To acquire low-level specific skills in these tasks, there are some existing methods available. Chief among them is the temporal difference method (Sutton 1988), a type of reinforcement learning that learn through exploiting the difference in evaluating actions in successive steps and thus handling sequences in an incremental manner. This approach has been applied to learning in mazes, navigation tasks, and robot control (Sutton 1990, Lin 1992, Mahadevan and Connell 1992). But they do not learn generic knowledge (generic rules). This approach can be extended to fullfledged dynamic programmingand partially observable Markov decision process models; however, these models often require a domain model to begin with.
In terms of learning generic knowledge or generic rules for such tasks, however, the characteristics of the task render most existing rule learning algorithms inapplicable. This is because they require either preconstructed exemplar sets (thus learning is not online; Michalski 1983, Quinlan 1986), incrementally given consistent instances (Mitchell 1982, Fisher 1986, Utgoff 1989), or complex manipulations of learned structures