| ![]() |
published in:
Proceedings of the Seventh International
Working Conference on Scientific and Statistical
Database Management (September
28-30, 1994, Charlottesville, Virginia)
Abstract
Interest in database support for scientific information management has been considerably growing in the recent past. In this paper, we look into one specific scientific area, i.e., molecular biology. The analysis of the current molecular biology working environment consisting of various independent software components shows several shortcomings and thus leads to requirements for an integrated working environment. We propose an architecture for an integrated system, the centre of which is a federated database system, the data repository. Its main benefits are a central data administration, a uniform data representation, the possibility of data reuse, and ad hoc query facilities. In addition, new application tools can be implemented on top of the repository easily and thus much faster than today.
1 Introduction
Due to advances in technology, instruments used in scientific laboratories get more complex and more automated. Scientists, like physicists, chemists, and biologists who use these instruments produce more and more data in shorter periods of time. Their requirements concerning quality, correctness and accuracy of data are increasing. New techniques have made it possible to process and store data in a way and at a cost never before achievable. Today, however, data produced in experiments and exploited in software for their analysis is stored in databases which are only in rare cases controlled by a database management system (DBMS). Software dealing with those scientific data can be obtained as public domain, or as commercial software tools. In most cases, however, it is written each time from scratch for one specific application, and sometimes it has to be rewritten even for every new experiment. The reason is that similar data is not stored in any uniform structure, but has to be read out of files which change from experiment to experiment. Obviously, the use of database management systems could provide considerable amelioration, and thus researchers have started to investigate the area of scientific
database management. Some of them look at the nature of scientific data and split it into categories according to the degree of analysis it has undergone, e.g., raw data (data that has not yet been processed), calibrated data (data that has passed some preprocessing), and more ([9], [22], [23]). Others take concrete scientific applications like ecology [6], medicine [12], chemistry [7], geosciences [11], or physics [19], and try to solve their specific problems.
Data of different scientific areas has very small common ground. Even the above-mentioned categorization is not general enough to be valid for all kinds of scientific data. From the point of view of database systems, general facilities like persistence, concurrency, or recovery are needed for any application, while required data models, query languages or storage strategies have rather specific aspects for different scientific areas. Compare for example economy and geosciences. While in the former concepts for the representation of spatial data are needed, the latter requires appropriate constructs for modelling time series. Given such diversity, it seems not very meaningful to investigate database support for scientific applications in general. Rather, we think it is much more promising to look at one particular area at a time; in our case, this is molecular biology (MB).
In this paper, after a short introduction into the application field, we analyse the current work flow in a MB laboratory in order to extract shortcomings like a vast number of independent software tools and databases (and thus no global overview of existing data), a high redundancy, no reuse of result data, and no book-keeping of performed experiments. Starting out from these weaknesses and the characteristics of MB data, we derive requirements for an integrated working environment. Further, we propose an architecture for an integrated system, called Moby Dick1, the centre of which ? and thus the most important part ? is a federated database system, the data repository, that is responsible for the management of data arising in a biological research environment. What we achieve in particular
1 Molecular biology federated DBMS-based integrated computer-supported working environment
A Federated DBMS-Based Integrated Environment for Molecular Biology
Barbara Rieche, Klaus R. Dittrich
Database Technology Research Group
Institut f?r Informatik, Universit?t Z?rich
Email: {rieche, dittrich}@ifi.unizh.ch