Tolkin talking -- Software and Speculations

My patent

"Using Indexes to Retrieve Stored Information" U.S. Patent 6,466,942 Issued Oct. 15, 2002. This patent describes a way to add columns to an index to improve the performance of certain queries that use index only access. This has been implemented in IBM's DB/2 dbms using the INCLUDE clause of the CREATE INDEX statement. It has also been implemented in Microsoft's SQL Server dbms. The INCLUDE clause adds extra columns that are not part of the uniqueness constraint.

My academic paper

Steven Tolkin, "Aggregation Everywhere: Data reduction and transformation in the Phoenix data warehouse", Proceedings of DOLAP '99, ACM Second International Workshop on Data Warehousing and OLAP, November 6, 1999, Kansas City, Missouri. Because I work in industry it is not that easy to get an academic paper published. (In fact this is the only one I have written.) So I put as much content as possible into this paper, explaining in detail the transformations, aggregations, and other processing needed to create a data warehouse from a record keeping system. This paper is cited by several patents.

Historically first paper on bitwise storage in databases

The TAXIR system was probably the first database to use bitwise storage. The bitwise storage technique is similar to bit-map indexes, in that each item is represented using bits, rather than a sequence of contiguous bytes, as in most conventional databases. They also share the common characteristic that the data is organized by columns (attributes), rather than by rows (records). The key difference is that in bitwise storage there is only one copy of the data, whereas a bit-map index is another physically stored access structure (the "index") in addition to the storage used for the "base table".

Several modern systems including Sybase IQ and Nucleus use this technique, which can dramatically improve performance by reducing both space (memory) and time. In fact both Sybase IQ and Nucleus have been granted patents in this area. But the basic ideas date back more than thirty years. Another very early system that used bit-map indexes (although not bitwise storage) is Model 204 from CCA, which is still being used today.

"The Theory of the TAXIR Accessioner" by George F. Estabrook and Robert C. Brill was published in Mathematical Biosciences 5 (1969) pp. 327-340. At the time of the publication the system had already been operational for several years, at the University of Colorado. Development was later based at the University of Michigan Computing Center. TAXIR was deployed at the several dozen sites running the MTS operating system, and became defunct along with it in the early 1990's. Note that TAXIR predates the relational model of databases. It was a "flat-file" system that (in modern terms) could do select and project, but not join.

Because this paper was published in a journal read primarily by biologists it was largely overlooked by the computer science community. It is not included in the large DBLP Computer Science Bibliography. Nor is this paper included in Gio Wiederhold's database bibliography "from the dark ages to 1991". The paper is still of interest today in part because of the recent activity in systems using bitwise storage. The paper differs from many contemporary publications in the computer science literature in that it both gives the details of the data structures and algorithms needed to implement the system, and also proves its claims for correctness, completeness, and (nearly) optimal space utilization.

I am making the paper available in several formats, due to the varying capabilities of browsers, etc. (Thanks to Elsevier for granting permission.) The first format represents certain special characters using the HTML 4.0 standard entity names. However not all the browsers fully support this standard yet, including Netscape 4 and Microsoft Internet Explorer 5.

To decide which format to use look here: ⊕

If you see a small plus sign in a circle then you can use format 1. If instead you see the literal string + or a small box, or anything else, then you should use format 2 when in a browser. Format 3, Postscript, will look the best, provided you have a way to print and/or view a PostScript file. In particular the "overscore" character is correctly displayed as a little bar directly over the character it applies to, as in conventional mathematics, rather than before it, as in the browser. However this format needs to be printed to a Postscript printer, or viewed using a tool such as Ghostscript or Adobe Acrobat. (The PostScript file was generated using the Perl script html2ps. In all three formats the rendering of the equations in the Appendix is not perfect, because there is not yet a standard way to markup math. Maybe later this could be done using MathML.)

It is available in the following formats:
  1. HTML using entities
  2. HTML using symbol font
  3. Postscript
  4. Word for Windows 95 (but not up to date!)

Here is my writeup that describes the bitwise storage technique, as used by TAXIR. This does not describe the exact techniques used by other systems.


Useful Software and Links


Speculations


Humor


Last changed: 2008-01-13. See changes

Please send any comments, questions, or suggestions to:

tolkin.family@comcast.net

Copyright 1999-2008 by Steven E. Tolkin