Table of Contents
Hash functions are functions that map a bit vector to another bit vector, usually shorter than the original vector and usually of fixed length for a particular function.
There are three primary uses for hash functions:
- Fast table lookup
- Message digests
Fast Table Lookup
Fast table lookup can be implemented using a hash function and a hash table. Elements are found in the hash table by calculating the hash of the element's key and using the hash value as the index into the table. This is clearly faster than other methods, such as examining each element of the table sequentially to find a match.
Message digests allow you to compare two large bit vectors and quickly determine if they are equal. Instead of comparing the vectors bit-by-bit, if the hash values of each bit vector are available you can compare the hash values. If the hash values are different, the original vectors must be different. If the hash values are the same then the original vectors are very likely to be the same if the hash function is good.
Message digests can use either cryptographic or non-cryptographic hash functions. If the purpose of the message digest is to determine if the original message has been tampered with, you would need to use a cryptographic hash function. If you just want to quickly tell if it's the same as another file with a different name (assuming the hash values have already been computed), you can use a non-cryptographic hash function.
Encryption is the transformation of data into a form unreadable by anyone without a secret decryption key. Hash functions play an important role in encryption because it is their properties that cause the encrypted data to be unreadable and the original data to be unrecoverable from the encrypted data without the decryption key.
Hash functions in this context are sometimes given other names such as mixing functions.
Properties of Hash Functions
For a function to be useful as a hash function, it must exhibit the property of uniform distribution. Some hash functions also have the one-way property. If a hash function is to be used for cryptography or for fast table lookup where the nature of the keys is unknown, the one-way property is a requirement.
All good hash functions share the property of collision avoidance, which means that the desired behavior is for unique inputs to provide unique outputs. If the length of the input vector is greater than the length of the output vector it's impossible to avoid all collisions because the set of input vectors will be of greater order than the set of output vectors. The hash function partitions the input set into subsets of input vectors that all produce the same output.
Since collisions cannot be avoided, the goal is to minimize their likelihood given two arbitrary input vectors. Minimization occurs when the size of the largest partition is minimized, that is, when the output vectors are evenly distributed among the corresponding input vectors. When this happens we say the output vectors are uniformly distributed.
Another important goal is that "similar" input vectors produce very different output vectors. For the case of table lookup, table keys are usually either human language strings, index numbers, or other data that exhibit non-randomness. For the case of message digest functions, the goal is to make it as difficult as possible to find two different messages that result in the same hash. A hash function with this property is called a one-way hash function.
Not all applications require hash functions to be one-way, but all cryptographic applications do. For example, a hash value for fast table lookup where the keys are telephone numbers could simply be the last four digits of the phone number. This hash function is suitable because it's likely to be uniformly distributed, but it's clearly not one-way because it is easy to devise a phone number that has a specific hash value.