Detecting Data and Schema Changes in Scientific Documents

Nabil Adam, Igg Adiwijaya
CIMIC - Rutgers University
180 Univ. Ave, Newark, NJ 07102 {adam, gusadi}@cimic.rutgers.edu

Terence Critchlow, Ron Musick
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
P.O. Box 808, L-561, Livermore, CA 94551
{critchlow,rmusick}@llnl.gov

Abstract

Data stored in a data warehouse must be kept consistent and up-to-date with respect to the underlying information sources. By providing the capability to identify, categorize and detect changes in these sources, only the modified data needs to be transfered and entered into the warehouse. The alternative, periodically reloading from scratch, is obviously inefficient. When the schema of an information source changes, all components that interact with, or make use of, data originating from that source must be updated to conform. The change detection problem is the problem of detecting data and schema changes by comparing two versions of the same semi-structured document. In this paper, we present an approach to detecting data and schema changes for scientific documents. Scientific data is of particular interest because it is normally stored as a semi-structured document, and suffers frequent schema updates. This paper demonstrates the use of a graph to represent scientific documents in particular, and semi-structured documents in general as well as their schema. It also demonstrates an approach to efficiently detect data and schema changes by merging the detection with parsing of the document.

Appeared in

IEEE Advances in Digital Libraries ADL-2000. May 2000. Bethesda MD.

Look at the Paper (ps.gz, pdf.gz)