The Verdict
Now we're ready to find out how well Bob Jenkins did with this mixing function
with regards to achieving avalanche.
Figure 1 shows the two lines of code we need to create an instance of the
Jenkins32 mixing function and call the AvalanceMatrix
method.
Here are the results, shown as percentages from 0 to 100, rounded to 1%:
Each row represents an input bit and each column represents an output bit.
So to determine the probability that bit 6 affects bit 1, look at the row
numbered "6" (the 7th row) and the column numbered "1" and you'll see that
the result is 51%.
The technical term for these results is sweet. We can be pretty confident
that this mixing function will do a good job at collision-avoidance
with 32-bit integers for the purposes of hash table lookup.
But if we examine the data very closely, we do find
that this function doesn't behave quite the same as a function
that satisfies the strict avalanche criterion (SAC) would. This does not
suggest that this is a bad mixing function. It's very good. I'm just using
it as an example to show how to analyze the avalanche behavior.
The numbers in red show the places where the probability differs by at least
4% from the expected value. For example, the 0th bit affects the 31st bit
54% of the time. To tell whether this is significant or whether it's just
a random variation due to the fact that we didn't actually check every
combination of input bits and instead just did a sample, we need to compare
this to a mixing function that satisfies SAC exactly.
Such a function would have exactly a 50% probability in every position, but
because we're sampling there would be variations. Since we did one million
samples it would behave like flipping a coin one million times, and we would
see a binomial distribution. This would look essentially the same as a
normal distribution with a mean of 50 and a standard deviation of 0.05.
But some positions would be worse than others. With 1024 elements (32 × 32), we can
expect that the worst positive deviation is going to be about 3.5 standard
deviations above the mean, or about 0.175. Because this is well under 0.5,
we shouldn't see any non-50's in the table if we round to the nearest 1%.
Since we don't (yet) know of a mixing function that obeys SAC exactly,
we can instead create one that actually does a 50% coin flip (simulated).
This isn't a real mixing function because the outputs don't depend on the
inputs, but it will fool the avalanche calculator into thinking it's real.
Figure 2 shows this mixing function.
I won't show the results of this function here because it would be a
complete waste of bandwidth, but you can run it yourself and see that
it shows 50's in every position, and the raw data shows that the worst
positions probably deviate from this by about 0.175 percentage points
(it varies each time you run it, of course, so your results may differ).
How Bad Can It Get?
If you compare the Jenkins32 function to a more primitive multiplicative
mixer that people have used in real hash functions, you'll see that we
could do much, much worse. Figure 3 shows such a function and a sub-section
of the results matrix. It clearly shows that this hash function is bad
at propagating high-order bits into low-order bit positions. Whether that's
acceptible or not depends on the application. And it's not really
as bad as it seems, because the intended use of this function requires
taking a prime modulus of the final value, which does scramble the lower
bits somewhat.
Can We Do Better?
Is it possible to do better than Jenkins32? The answer is yes.
To illustrate this, let's just run Jenkins32 twice. The first step
already does a great job at mixing, so if we do it twice it should be
even better, right? We can easily do this by calling the AvalancheMatrix
method with a repetitions parameter of 2. And we find that,
in fact, the results are near-perfect. They are practically indistinguishable
from a SAC-perfect mixing function.
Note that this isn't true of every mixing function. With
KnuthMultiplicative, those 0's and 100's never go away because
of where they're positioned.
You might wonder if a mixing function that exactly satisfies SAC even exists.
Once again the answer is yes. Figure 4 shows a 4-bit mixing function with
this property. The truth table of this function is shown so you can confirm
that SAC is actually satisfied.
We know that we can get arbitrarily close just by
concatenating a good-enough function two or more times. The challenge is to
get closer without adding additional computation time.
Mixing Function Search
Warning! I can think of no practical reason why you would need "perfect"
avalanche instead of the excellent avalanche achieved by the Jenkins32
integer hash. The rest of this page is motivated purely by intellectual curiosity.
To create the mixing function of Figure 4 I did a random search of different
permutation tables until I came across one that satisfied the SAC. That works
for a 4-bit function, but a lookup table isn't going to be feasible for a
32-bit or larger function. So we can choose to limit our search to functions
that are a combination of the reversible operations from the previous page.
That's where those "magic numbers" come in. We choose them at random! Some
values will be good, some will be bad, but we have a way to measure to see
which are the best.
But a completely random search will be too slow. Even if we restrict our search
to functions of the same form as Jenkins32, that's still 318
combinations, or almost a million million. And most of those will be
complete rubbish.
Instead what we'll do is start with a known-good function and try modifying each
of the constants one at a time to see if we can improve on the previous avalanche
matrix. In order to compare two functions to determine which is better, we'll need
a single number that summarizes the avalance matrix results. I chose to use the
"sum of squared errors", which just means I take the difference between each matrix
entry and the expected value of 0.5, square it, and then sum all 1024 values
together.
I used 100 thousand trials for the matrix calculation. If you
use too few trials there is too much "noise" in the results and you can't reliably
compare values. The minimum result we can expect to get this way is 0.00256,
which is what you would get if the each matrix value were distributed binomially,
which is what you'd expect from a function that satisfies SAC. Jenkins32
gets about 0.0257, about 10 times the expected squared error.
Figure 5 shows the trajectory of the best "search path" that I've been able to
find using Jenkins32 as the starting point. It shows that we're
able to reduce the squared error essentially equal to the theoretical minimum,
minus some sampling noise.
First page,
Next page