DataFoundry: Information Management for Scientific Data
Terence Critchlow, Krzysztof Fidelis, Ron Musick, Tom
Slezak
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
P.O. Box 808, L-561, Livermore, CA 94551
{critchlow, fidelis, rmusick, slezak}@llnl.gov
and
Madhavan Ganesh
Gene Logic
ganesh@cs.umn.edu
Abstract
Data warehouses and data marts have been successfully applied to a
multitude of commercial business applications. They have proven to be
invaluable tools by integrating information from distributed,
heterogeneous sources and summarizing this data for use throughout the
enterprise. Although the need for information dissemination is as
vital in science as in business, working warehouses in this community
are scarce because traditional warehousing techniques don't
transfer to scientific environments. There are two primary reasons for
this difficulty. First, schema integration is more difficult for
scientific databases than for business sources, because of the
complexity of the concepts and the associated relationships. While
this difference has not yet been fully explored, it is an important
consideration when determining how to integrate autonomous
sources. Second, scientific data sources have highly dynamic data
representations (schemata). When a data source participating in a
warehouse changes its schema, both the mediator transferring data to
the warehouse and the warehouse itself need to be updated to reflect
these modifications. The cost of repeatedly performing these updates
in a traditional warehouse, as is required in a dynamic environment,
is prohibitive. This paper discusses these issues within the context
of the DataFoundry project, an ongoing research effort at LLNL.
DataFoundry utilizes a unique integration strategy to identify
corresponding instances while maintaining differences between data
from different sources, and a novel architecture and an extensive
meta- data infrastructure, which reduce the cost of maintaining a
warehouse.
To appear in
IEEE Transactions on Information Technology in Biomedicine, Volume 4. Number 1. March 2000.