page 1  (4 pages)
2to next section

Copyright ? IEEE 1995.

Appears in Proc. 1995 IEEE Int. Conf. Neural Networks, pp. 1338-1341.

Minimisation of Data Collection by Active Learning

Tirthankar RayChaudhuri and Leonard G.C. Hamey

[email protected]

School of MPCE, Macquarie University, New South Wales 2109, Australia

ABSTRACT

We use the `query-by-committee' approach for building an active scheme for data collection. In this method data gathering is reduced to a minimum, yet modelling accuracy is uncompromised. Our active querying criterion is determined by whether or not several models agree when they are fitted to random subsamples of a small amount of collected data. Experiments with neural network models to establish the feasibility of our algorithm have produced encouraging results.

1. Introduction: Active Learning

The traditional approach to studying generalisation has been through random examples. It has been found, however, that random examples contain progressively less information as learning proceeds [7, 11]. The quest for more reliable learning techniques has led researchers to examine statistical active querying as a means of obtaining training data that will produce improved generalisation [6]. Such a query-based training process is often called `active learning'.

1.1. Motivation

The driving forces behind research in active learning algorithms are minimisation of both generalisation error and data sampling. These twin goals are apparently contradictory. Data sampling involves both collection and measurement of data. This is expensive and therefore needs to be reduced as much as possible. Our proposed algorithm [8] suggests a means of achieving both the above objectives simultaneously.

2. An Algorithm to Minimise Data

Collection

We have used the `query-by-committee' approach [3, 9]. If several models are fitted to random subsamples of a small amount of initially collected data, they will probably disagree in the first instance. If we add minimal additional data according to some defined criterion to our sample and repeat the process of having several models examine random subsamples, then after several iterations of this process the models must agree closely at some stage. Our ideas are based upon active learning concepts introduced by Cohn at al [1, 2] and Krogh and Vedelsby [5]. The emphasis of our algorithm

Randomly collect a small

Randomly subsample
several equal sets from S

Fit several models with
the subsampled data sets

Yes
Stop . S contains
enough information
for learning

No

from the system and add
it to S

Do the models agree
sufficiently?

Collect another point

data sample S

Fig. 1: Active Learning Algorithm to Minimise Data Collection