Copyright ? IEEE 1995.
Appears in Proc. 1995 IEEE Int. Conf. Neural Networks, pp. 1338-1341.
Minimisation of Data Collection by Active Learning
Tirthankar RayChaudhuri and Leonard G.C. Hamey
School of MPCE, Macquarie University, New South Wales 2109, Australia
We use the `query-by-committee' approach for building an active scheme for data collection. In this method data gathering is reduced to a minimum, yet modelling accuracy is uncompromised. Our active querying criterion is determined by whether or not several models agree when they are fitted to random subsamples of a small amount of collected data. Experiments with neural network models to establish the feasibility of our algorithm have produced encouraging results.
1. Introduction: Active Learning
The traditional approach to studying generalisation has been through random examples. It has been found, however, that random examples contain progressively less information as learning proceeds [7, 11]. The quest for more reliable learning techniques has led researchers to examine statistical active querying as a means of obtaining training data that will produce improved generalisation . Such a query-based training process is often called `active learning'.
The driving forces behind research in active learning algorithms are minimisation of both generalisation error and data sampling. These twin goals are apparently contradictory. Data sampling involves both collection and measurement of data. This is expensive and therefore needs to be reduced as much as possible. Our proposed algorithm  suggests a means of achieving both the above objectives simultaneously.
2. An Algorithm to Minimise Data
We have used the `query-by-committee' approach [3, 9]. If several models are fitted to random subsamples of a small amount of initially collected data, they will probably disagree in the first instance. If we add minimal additional data according to some defined criterion to our sample and repeat the process of having several models examine random subsamples, then after several iterations of this process the models must agree closely at some stage. Our ideas are based upon active learning concepts introduced by Cohn at al [1, 2] and Krogh and Vedelsby . The emphasis of our algorithm
Randomly collect a small
several equal sets from S
Fit several models with
the subsampled data sets
Stop . S contains
from the system and add
it to S
Do the models agree
Collect another point
data sample S
Fig. 1: Active Learning Algorithm to Minimise Data Collection