In: Proceedings of Language Engineering Convention - Paris 6-7 July 1994, pp. 17-22, ELSNET, Edinburgh, 1994
Test Suites for Natural Language Processing
Lorna Balkan? Klaus Netterz Doug Arnold? Siety Meijer?
? University of Essex, United Kingdom
zDFKI GmbH, Saarbrucken, Germany
This paper describes the LRE project TSNLP (Test Suites for Natural Language Processing), which is concerned with some central issues in the design and use of test suites. The project combines theoretical research with practical implementations, aiming to provide generally usable tools and test data together with reports discussing the theoretical background. The paper begins by setting out the motivation, aims, and present state of the project, then examines the methodological issues behind it1.
In a Natural Language Processing context, a test suite is a more or less systematic collection of specially constructed linguistic expressions (test items, e.g. sentences), perhaps with associated annotations and descriptions. Test suites have long been accepted in the NLP community as a useful tool for diagnostic evaluation. However most of the existing test suites have been written for specific systems or simply contain numbers of interesting examples". This does not meet the demand for large, systematic, well documented and annotated collections of linguistic material, required by a growing number of NLP applications. Collections of this type are not only useful as diagnostic tools, but can also support other kinds of evaluation. Last but not least, large data collections could also serve as repositories of linguistic phenomena for developers.
The TSNLP project addresses a range of issues related to the construction and use of test suites. The main goals of the project are to:
1. Provide Guidelines for test suite construction
Guidelines will help the test suite constructor to write coherent and systematic test data. They address issues like the identification of linguistic phenomena and sub-phenomena to be tested, the choice of the vocabulary used in test data, how can one derive ill-formed data and to what extent the interaction between different phenomena should be tested.
2. Construct substantial test fragments
1We would like to thank our partners in the TSNLP project: Dominique Estival, Kirsten Falkedal and Sabine Lehmann at ISSCO, Geneva and Sylvie Regnier-Prost and Eva Dauphin at Aerospatiale, Paris
Test items will be written for parsers, grammar checkers and controlled language checkers. They will contain a number of pre-selected phenomena, and will become available in three languages: English, French and German. Essential is that the test data will be annotated in a way which considerably improves the informativity of the test data.
3. Develop a database
The test items will be stored in a database, where the annotations on the test items allow for ease of access to and manipulation of data. Work done at DFKI Saarbrucken  will serve as the starting point.
4. Investigate test suite construction tools
The project aims to develop methods for automating some of the processes involved in test suite construction. The tools considered here are a test suite generation tool (from grammars and corpora), and an automatic lexicon replacement tool.
The guidelines and test data will be validated by a testing phase.
The results of the project will become public domain and it is expected that the availability of validated test suites for a range of applications, together with the tools for their manipulation and use will be of great value to the NLP community.
Research to Date
During the first phase of the project a study of existing, publicly available test suites was performed. The study revealed that test suites range in evaluation purpose, intended application (parsers, MT systems, etc.), depth and breadth of coverage, etc.
Existing test suites display a degree of systematicity, where test items generally only contain one linguistic phenomenon more than those previously tested. They are primarily concerned with the coverage of syntactic phenomena, often include (some) ill-formed data and have restricted lexical coverage. Some general shortcomings of existing test suites are:
ffl Lack of morphological, semantic and extragrammatical phenomena.
ffl Lack of systematicity in testing ill-formed constructions and the co-occurrence of different phenomena.
ffl Lack of documentation and annotation. This is