| ![]() |
Copyright ? IEEE 1995.
Appears in Proc. 1995 IEEE Int. Conf. Neural Networks, pp. 1338-1341.
Minimisation of Data Collection by Active Learning
Tirthankar RayChaudhuri and Leonard G.C. Hamey
School of MPCE, Macquarie University, New South Wales 2109, Australia
ABSTRACT
We use the `query-by-committee' approach for building an active scheme for data collection. In this method data gathering is reduced to a minimum, yet modelling accuracy is uncompromised. Our active querying criterion is determined by whether or not several models agree when they are fitted to random subsamples of a small amount of collected data. Experiments with neural network models to establish the feasibility of our algorithm have produced encouraging results.
1. Introduction: Active Learning
The traditional approach to studying generalisation has been through random examples. It has been found, however, that random examples contain progressively less information as learning proceeds [7, 11]. The quest for more reliable learning techniques has led researchers to examine statistical active querying as a means of obtaining training data that will produce improved generalisation [6]. Such a query-based training process is often called `active learning'.
1.1. Motivation
The driving forces behind research in active learning algorithms are minimisation of both generalisation error and data sampling. These twin goals are apparently contradictory. Data sampling involves both collection and measurement of data. This is expensive and therefore needs to be reduced as much as possible. Our proposed algorithm [8] suggests a means of achieving both the above objectives simultaneously.
2. An Algorithm to Minimise Data
Collection
We have used the `query-by-committee' approach [3, 9]. If several models are fitted to random subsamples of a small amount of initially collected data, they will probably disagree in the first instance. If we add minimal additional data according to some defined criterion to our sample and repeat the process of having several models examine random subsamples, then after several iterations of this process the models must agree closely at some stage. Our ideas are based upon active learning concepts introduced by Cohn at al [1, 2] and Krogh and Vedelsby [5]. The emphasis of our algorithm
Randomly collect a small
Randomly subsample
several equal sets from S
Fit several models with
the subsampled data sets
Yes
Stop . S contains
enough information
for learning
No
from the system and add
it to S
Do the models agree
sufficiently?
Collect another point
data sample S
Fig. 1: Active Learning Algorithm to Minimise Data Collection