The bitwise storage technique: an introduction

Steven Tolkin

The bitwise storage technique is an unconventional and high-performance way to represent data, e.g. in a database table.

It has the following properties:

The representation of a field in memory has two components: a two-way dictionary and some bitstrings.

The two-way dictionary is simple.

Consider a "marital status" field. Its dictionary might look like:

In conventional data storage the bits comprising a single value are stored in contiguous memory locations. However the bitwise technique uses a "transposed" representation. Suppose that B bits are needed to represent the distinct values for a field, and there are R records in the table.

Note that the bitstring for bit 0 need not be near the bitstring for bit 1. This approach allows records to be added, or for a domain to grow bits, without the table needing any physical reorganization. An image may help the reader: The values are upside-down, like bats in a cave, hanging from their low order bits.

For example, suppose that the first few records describe people who are single, married, widowed, divorced, unknown, i.e. the integers 1,2,3,4,0. Assume the machine has 32 bits per word. Then part of the representation in memory would be:

Record: 1 2 3 4 5 ... 32 33 ... 64 ...
bit: 0  1 0 1 0 0 ... .. .. ... .. ...     <<< words 0, 1, ...
bit: 1  0 1 1 0 0 ... .. .. ... .. ...     <<< words m, m+1, ...
bit: 2  0 0 0 1 0 ... .. .. ... .. ...     <<< words n, n+1, ...

Suppose the query is to find all the single people. The basic operation on one record is to evaluate the boolean expression: bit 0 & ~ bit 1 & ~ bit 2. Note that this is actually evaluated a word at a time, e.g. on machines having 32 bits per word the processor would evaluate 32 records at a time. Furthermore, any number of processors can work in parallel with no interference.

The above example shows that a selection might need to examine all B bits of a value and perform B*2 operations. Often the boolean expression can be optimized. When a set of values is being searched it may be that certain bits only need to be examined once. For example to find people who are single or divorced the expression is: ~ bit 1 & (bit 0 | bit 2)

The optimization of boolean functions has been an active area of research in computer science. This is sometimes called "Minimization of Boolean functions". If operations other than AND, OR, and NOT are supported by the hardware, e.g. XOR, EQUIV, IMPLY, NAND, etc., then the optimizer has more opportunities to reduce the evaluation cost.

The original query might have included selections on several fields. The operations on each field are combined until a single result bitstring is obtained. Each on bit represents a record that satisfies the query. The number of each on bit directly corresponds to the record number in the table. To determine the values of the projected fields a direct arithmetic calculation is done to determine the integer values, which are then looked up in the dictionary and displayed.

Notice the space savings in the example above. Because the field only has 5 values it needs only 3 bits per record. A conventional system would use at least 8 bits.

A huge space savings is possible in a database where certain strings occur in many records. E.g. suppose the table has a field for first name and "Dave" occurs frequently, but there are only 256 distinct first names in the company. In a conventional rdbms each "Dave" uses 32 bits, but in the bitstring approach each one uses only 8 bits (plus the small overhead for the dictionary). A more realistic example would be the list of states and provinces, which can be represented in 7 bits rather than 2 bytes.

The most dramatic savings occur with the yes/no fields often found on questionnaires and in market research databases. For example the respondent is presented a list of checkboxes for his interests: golf, sailing, tennis, etc. Each interest is a separate field. Using bitwise storage each uses only one bit. In conventional databases these would each consume 8 bits.

A dictionary is not needed for numeric fields. Instead we just remember the "from", "to", and "by" values and map to the integers. For example a field defined in SQL as number(5,2), can take on the values from -999.99 to 999.99 by .01. Suppose this field does allow the null value. In Oracle the storage size depends on the actual number, but most numbers in this range would use 4 bytes. Using the bitwise method these will use only 18 bits, plus one bit for the null value, i.e. a little more than half the space.

Additional notes: The bitstrings need not be contiguous. They can even span several disks. All that is needed is a "directory" which maps from bit numbers to the location of the appropriate substring. Each substring should begin at a disk block size boundary. When each processor decides which piece to do it will round to this boundary.

The project operation can be slow. One approach is for fields that are often projected but rarely selected to be placed in a conventional table, whose key is the record number.

Each field can allow nulls or not. If it allows null this can be implemented either by using a particular value (typically all zero bits) or by an extra bit. The extra bit technique wastes space, but can be sometimes useful when doing inter-field compares or joins, or for numeric fields.

A field's dictionary can be shared with other fields in the same or even other tables. This leads to further space reduction and improved performance. (An inter-field compare is fastest if both fields have the same domain. Otherwise they must be mapped to their external values before being compared.) A somewhat more elaborate dictionary can handle one field's domain being a subset of another's. The dictionary can be used to transform or validate data on the way in or out. For example various forms for state names could be accepted, (e.g. MA, Mass, etc.) but on output the official two letter code would always be used.

It is possible for the result bitstring (or any of the intermediate bitstrings) to be compressed. This is particularly useful if a secondary index is built.

There is one extra bitstring in the table used for record deletions. Normally a delete involves just turning on that record's delete bit. An insertion always goes at the end (or at the end of some part for parallel insertions). This implies a small amount of wasted space after deletions, at least until the subsequent physical reorganization. An alternative strategy is for each insert to locate a deleted record to fill in.

Besides the general optimization of the boolean expression other specialized optimizations are also possible. In our example a query that searches for divorced people only needs to check if bit 2 is on. This uses the fact that the highest value in the domain is a power of 2.

Some extra information is needed for record locking, e.g. an extra bitstring or a lock table, depending on the exact strategy used. The exact technique to be used varies, i.e. certain techniques may be appropriate for tables that have inserts but not updates.

(end of document)

Last Changed: January 3, 1999

Version: 1.2

Copyright © 1999 Steven E. Tolkin