Detecting Data and Schema Changes in Scientific Documents
Nabil Adam, Igg Adiwijaya
CIMIC - Rutgers University
180 Univ. Ave, Newark, NJ 07102
{adam, gusadi}@cimic.rutgers.edu
Terence Critchlow, Ron Musick
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
P.O. Box 808, L-561, Livermore, CA 94551
{critchlow,rmusick}@llnl.gov
Abstract
Data stored in a data warehouse must be kept consistent and up-to-date
with respect to the underlying information sources. By providing the
capability to identify, categorize and detect changes in these
sources, only the modified data needs to be transfered and entered
into the warehouse. The alternative, periodically reloading from
scratch, is obviously inefficient. When the schema of an information
source changes, all components that interact with, or make use of,
data originating from that source must be updated to conform. The
change detection problem is the problem of detecting data and schema
changes by comparing two versions of the same semi-structured
document. In this paper, we present an approach to detecting data
and schema changes for scientific documents. Scientific data is of
particular interest because it is normally stored as a semi-structured
document, and suffers frequent schema updates. This paper
demonstrates the use of a graph to represent scientific documents in
particular, and semi-structured documents in general as well as their
schema. It also demonstrates an approach to efficiently detect data
and schema changes by merging the detection with parsing of the document.
Appeared in
IEEE Advances in Digital Libraries ADL-2000. May 2000. Bethesda MD.