﻿ Pluto Scarab — Strict Avalanche Criterion

# Hash Functions (continued)

## The Verdict

Now we're ready to find out how well Bob Jenkins did with this mixing function with regards to achieving avalanche. Figure 1 shows the two lines of code we need to create an instance of the `Jenkins32` mixing function and call the `AvalanceMatrix` method.

Here are the results, shown as percentages from 0 to 100, rounded to 1%:

```     0  1  2  3  4  5  6  7   8  9 10 11 12 13 14 15  16 17 18 19 20 21 22 23  24 25 26 27 28 29 30 31

0  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 52 50 50 52 50 54
1  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 52 50 50 52 50
2  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 52 50 50 50
3  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 52 49 50
4  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 51 50
5  51 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 51 54
6  50 51 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
7  49 50 51 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50

8  50 49 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
9  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
10  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 51
11  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
12  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
13  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
14  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
15  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50

16  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  49 50 50 50 50 50 50 50
17  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 49 50 50 50 50 50 50
18  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 49 50 50 50 50 50
19  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 49 50 50 50 50
20  50 55 50 54 50 50 48 50  51 50 50 50 50 50 50 50  50 50 50 50 51 50 50 50  50 50 50 50 50 50 50 50
21  53 50 54 50 52 50 50 48  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 47 50
22  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
23  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50

24  50 50 50 50 50 49 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50
25  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 51 50
26  49 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 51 50
27  49 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 51 54
28  50 48 49 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 51  50 50 50 50 50 50 51 46
29  50 50 48 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  52 50 50 50 50 50 49 51
30  48 50 50 48 50 50 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 52 50 50 50 50 50 50
31  50 48 50 50 50 51 50 50  50 50 50 50 50 50 50 50  50 50 50 50 50 50 50 50  50 50 53 50 50 50 49 51
```

Each row represents an input bit and each column represents an output bit. So to determine the probability that bit 6 affects bit 1, look at the row numbered "6" (the 7th row) and the column numbered "1" and you'll see that the result is 51%.

The technical term for these results is sweet. We can be pretty confident that this mixing function will do a good job at collision-avoidance with 32-bit integers for the purposes of hash table lookup.

But if we examine the data very closely, we do find that this function doesn't behave quite the same as a function that satisfies the strict avalanche criterion (SAC) would. This does not suggest that this is a bad mixing function. It's very good. I'm just using it as an example to show how to analyze the avalanche behavior.

The numbers in red show the places where the probability differs by at least 4% from the expected value. For example, the 0th bit affects the 31st bit 54% of the time. To tell whether this is significant or whether it's just a random variation due to the fact that we didn't actually check every combination of input bits and instead just did a sample, we need to compare this to a mixing function that satisfies SAC exactly.

Such a function would have exactly a 50% probability in every position, but because we're sampling there would be variations. Since we did one million samples it would behave like flipping a coin one million times, and we would see a binomial distribution. This would look essentially the same as a normal distribution with a mean of 50 and a standard deviation of 0.05. But some positions would be worse than others. With 1024 elements (32 × 32), we can expect that the worst positive deviation is going to be about 3.5 standard deviations above the mean, or about 0.175. Because this is well under 0.5, we shouldn't see any non-50's in the table if we round to the nearest 1%.

Since we don't (yet) know of a mixing function that obeys SAC exactly, we can instead create one that actually does a 50% coin flip (simulated). This isn't a real mixing function because the outputs don't depend on the inputs, but it will fool the avalanche calculator into thinking it's real. Figure 2 shows this mixing function.

I won't show the results of this function here because it would be a complete waste of bandwidth, but you can run it yourself and see that it shows 50's in every position, and the raw data shows that the worst positions probably deviate from this by about 0.175 percentage points (it varies each time you run it, of course, so your results may differ).

## How Bad Can It Get?

If you compare the Jenkins32 function to a more primitive multiplicative mixer that people have used in real hash functions, you'll see that we could do much, much worse. Figure 3 shows such a function and a sub-section of the results matrix. It clearly shows that this hash function is bad at propagating high-order bits into low-order bit positions. Whether that's acceptible or not depends on the application. And it's not really as bad as it seems, because the intended use of this function requires taking a prime modulus of the final value, which does scramble the lower bits somewhat.

## Can We Do Better?

Is it possible to do better than Jenkins32? The answer is yes. To illustrate this, let's just run Jenkins32 twice. The first step already does a great job at mixing, so if we do it twice it should be even better, right? We can easily do this by calling the `AvalancheMatrix` method with a `repetitions` parameter of 2. And we find that, in fact, the results are near-perfect. They are practically indistinguishable from a SAC-perfect mixing function.

Note that this isn't true of every mixing function. With `KnuthMultiplicative`, those 0's and 100's never go away because of where they're positioned.

You might wonder if a mixing function that exactly satisfies SAC even exists. Once again the answer is yes. Figure 4 shows a 4-bit mixing function with this property. The truth table of this function is shown so you can confirm that SAC is actually satisfied.

We know that we can get arbitrarily close just by concatenating a good-enough function two or more times. The challenge is to get closer without adding additional computation time.

## Mixing Function Search

Warning! I can think of no practical reason why you would need "perfect" avalanche instead of the excellent avalanche achieved by the `Jenkins32` integer hash. The rest of this page is motivated purely by intellectual curiosity.

To create the mixing function of Figure 4 I did a random search of different permutation tables until I came across one that satisfied the SAC. That works for a 4-bit function, but a lookup table isn't going to be feasible for a 32-bit or larger function. So we can choose to limit our search to functions that are a combination of the reversible operations from the previous page. That's where those "magic numbers" come in. We choose them at random! Some values will be good, some will be bad, but we have a way to measure to see which are the best.

But a completely random search will be too slow. Even if we restrict our search to functions of the same form as `Jenkins32`, that's still 318 combinations, or almost a million million. And most of those will be complete rubbish.

Instead what we'll do is start with a known-good function and try modifying each of the constants one at a time to see if we can improve on the previous avalanche matrix. In order to compare two functions to determine which is better, we'll need a single number that summarizes the avalance matrix results. I chose to use the "sum of squared errors", which just means I take the difference between each matrix entry and the expected value of 0.5, square it, and then sum all 1024 values together.

I used 100 thousand trials for the matrix calculation. If you use too few trials there is too much "noise" in the results and you can't reliably compare values. The minimum result we can expect to get this way is 0.00256, which is what you would get if the each matrix value were distributed binomially, which is what you'd expect from a function that satisfies SAC. `Jenkins32` gets about 0.0257, about 10 times the expected squared error.

Figure 5 shows the trajectory of the best "search path" that I've been able to find using `Jenkins32` as the starting point. It shows that we're able to reduce the squared error essentially equal to the theoretical minimum, minus some sampling noise.