0. Introduction
For a long time I have been looking for "good" permutations on $[0, 2^{32})$ and $[0, 2^{64})$. The definition of "good" in this context is a permutation that looks like it had been chosen at random from any of the $2^{32}!$ or $2^{64}!$ such possible functions and is reasonably fast. These design criteria are somewhat vague; the identity function is just as random as any other function on the domain. If speed wasn't an issue, any half decent block cipher would fit the bill.A straightforward (but time consuming) test of "goodness" is to feed a low-entropy source such as a simple counter into the mixing function and passing the output to a test battery for randomness.
Perhaps the most well established set of tools for testing random number generation is L'Ecuyer and Simard's TestU01. It contains three predefined sets of tests ("batteries") that any stream of random words should expect to pass without being flagged as anomalous. For a 32-bit bijection, TestU01's batteries Crush and BigCrush should detect anomalies due to the short period, $2^{32}$, regardless of how well the permutation has been chosen.
A different test-suite that has a much more hackish flavour to it is PractRand. It has a tendency to fail some generators very quickly and takes a quite different approach to testing than TestU01 does; it runs until it fails spectacularly or $2^{46}$ bytes have been inspected.
A randomly chosen permutation $f: 2^b \rightarrow 2^b$ could be expected to have (at least) the following properties ($\oplus$ denoting exclusive or):
- For any $0 \le x < 2^b, 0 \le i < b, \Delta_1 = f(x) \oplus f(x \oplus 2^i)$ should have Hamming weights ("number of set bits", a.k.a. population count) that are binomially distributed, $B(b, \frac{1}{2})$.
- For any $0 \le x < 2^b$ and tuples $a$ of length $t, 0 \le t < b$ and $0 \le a_k < b, a_k < a_{k + 1}, 1 \le k \le t$, let $s = \sum_{i=1}^t 2^{a_i}$ and let $\Delta_t = f(x) \oplus f(x \oplus s)$.
$\Delta_t$ should now also have Hamming weights with a $B(b, \frac{1}{2})$ distribution. - When being fed a low entropy sequence, say $x_n = n$ or something of similar simplicty, $f(x_i) \Vert f(x_{i + 1}) \Vert \ldots$ should look like a random bit sequence when the output words are concatenated.
The first property I think follows from the strict avalanche criterion, i.e. a function fulfilling the SAC also has binomially distributed Hamming weight-deltas.
The second property should follow from a function fulfilling a $t$-order SAC.
The other day I came across "Fast Splittable Pseudorandom Number Generators" [1] by Guy L. Steele Jr., Doug Lea and Christine H. Flood. Obviously, I am not Mr. Current Affairs since the paper is 4 years old by now. Anyway, I noticed that the mixing operation looked very similar to a family of functions I've been tinkering with, trying to find something that is reasonably fast and that satisfies the properties above. The mixing functions in [1] were in turn borrowed from Austin Appleby's MurmurHash3 and David Stafford's improved variants [2] on MurmurHash3.
MurmurHash3 uses a 64-bit mixer in its final step, looking like this:
"Appleby used this criterion [SAC, possibly first order?] as the “energy function” in a simulated annealing process to explore a specific space of potential 8-operation bit-mixing functions (both 32-bit and 64-bit versions."
Stafford chose a different search strategy than Appleby: trying low entropy data (instead of random numbers) in his inputs to the mixing function, assuming that a function that behaves well on low entropy inputs also would be well behaved for high entropy inputs. According to his search, that assumption was correct. He does, however, not specify what the low entropy inputs were or if he measured $f(x)$ or $f(x) \oplus f(x \oplus c)$ for low entropy (single bit?) constants $c$.
The authors of [1] give a short snippet in Fig. 18, that gives a sum-of-squares [first order] avalanche statistic with randomized inputs, flipping each bit to compute $\Delta_1$ as in property 1 above. They stick to the constants from MurmurHash3 and includes Variant13 from Stafford since they conclude that they may be close to optimal with regards to $\Delta_1$.
MurmurHash3 uses a 64-bit mixer in its final step, looking like this:
uint64_t mix(uint64_t v) {
v ^= (v >> S1);
v *= M1;
v ^= (v >> S2);
v *= M2;
return v ^ (v >> S3);
}
v ^= (v >> S1);
v *= M1;
v ^= (v >> S2);
v *= M2;
return v ^ (v >> S3);
}
Where S1 = S2 = S3 = 33, M1 = 0xFF51AFD7ED558CCD and M2 = 0xC4CEB9FE1A85EC53.
I can't find Appleby's description on how the search for constants was performed but in section 4 of [1]:"Appleby used this criterion [SAC, possibly first order?] as the “energy function” in a simulated annealing process to explore a specific space of potential 8-operation bit-mixing functions (both 32-bit and 64-bit versions."
Stafford chose a different search strategy than Appleby: trying low entropy data (instead of random numbers) in his inputs to the mixing function, assuming that a function that behaves well on low entropy inputs also would be well behaved for high entropy inputs. According to his search, that assumption was correct. He does, however, not specify what the low entropy inputs were or if he measured $f(x)$ or $f(x) \oplus f(x \oplus c)$ for low entropy (single bit?) constants $c$.
The authors of [1] give a short snippet in Fig. 18, that gives a sum-of-squares [first order] avalanche statistic with randomized inputs, flipping each bit to compute $\Delta_1$ as in property 1 above. They stick to the constants from MurmurHash3 and includes Variant13 from Stafford since they conclude that they may be close to optimal with regards to $\Delta_1$.
Fig. 18 of [1] , contains a glaring bug by the way, maybe that's not the code they used when doing their search but a simple error that got introduced when formatting the code for the paper:
What is not so nice, in the context of creating mixers, is that low input Hamming weights tends to stay low, thus not mixing things as fast as we would like.
Multiplication is a really nice operation in this context; the topmost bits are influence by all bits of lower significance. As long as the constant is odd and the multiplication is done modulo $2^b$, an inverse exists and the operation is thus bijective. What's not so nice is that high input bits can't influence low order bits (when the modulus is on the form $2^b$).
Doing an interleaving of multiplying and xorshifting (to the right) is a really brilliant idea; multiplying makes all low order bits influence the top most bits. Xorshifting folds the topmost bits down to the right. And the defect that xorshifting has w.r.t. Hamming weights is mostly remedied by multiplication.
I first saw a very similar construction in SHA-512, but as y = ror(x, a) ^ ror(x, b) ^ x >> c. The snippet with two rors above is the special case where the last term is omitted.
long x = w ^ m.mix(v ^ (1 << i));
should of course be:
long x = w ^ m.mix(v ^ (1L << i));)
Anyhow, what is nice about xor-shifting is that it's guaranteed to be bijective on $2^b$ as long as the amount shifted, $s$, is $0 < s < b$. (This can be verified by checking that the binary matrices of $(I + L^s)$ and $(I + R^s)$ have inverses.)
What is not so nice, in the context of creating mixers, is that low input Hamming weights tends to stay low, thus not mixing things as fast as we would like.
Multiplication is a really nice operation in this context; the topmost bits are influence by all bits of lower significance. As long as the constant is odd and the multiplication is done modulo $2^b$, an inverse exists and the operation is thus bijective. What's not so nice is that high input bits can't influence low order bits (when the modulus is on the form $2^b$).
Doing an interleaving of multiplying and xorshifting (to the right) is a really brilliant idea; multiplying makes all low order bits influence the top most bits. Xorshifting folds the topmost bits down to the right. And the defect that xorshifting has w.r.t. Hamming weights is mostly remedied by multiplication.
Anyway, I was looking for a better way to increase the avalanche rate but without having more than 2 multiplication/xorshift steps and figured that
x ^= ror(x, a) ^ ror(x, b)
is bijective when $0 \le x < 2^{64}, 0 < a < 64, 0 < b < 64, a \ne b$, which looked promising as a preliminary step.I first saw a very similar construction in SHA-512, but as y = ror(x, a) ^ ror(x, b) ^ x >> c. The snippet with two rors above is the special case where the last term is omitted.
1. A better mixer
Most recently I have come up with this:
M = 0x9fb21c651e98df25L, S = 28, R1 = 49, R2 = 24
uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, R1) ^ ror64(v, R2);
v *= M;
v ^= v >> S;
v *= M;
return v ^ v >> S;
}
v ^= ror64(v, R1) ^ ror64(v, R2);
v *= M;
v ^= v >> S;
v *= M;
return v ^ v >> S;
}
Before I get to the advantages of using this over MurmurHash3's finalizer or Stafford's variants, I'll describe how I found the constants.
As I was looking for constants I was using a different energy criteria based on properties 1 and 2 above:
Create a series of low entropy base constants such as $1, 3, 5, 7, 11_{16}, 55_{16}, 5555_{16}$ and so on. Create all 64 bit rotations of these constants along with their bitwise complements, bit order reversed and bit order reversed and complemented. Each base constant gives rise to 4 * 64 possible constants.
Also create all 64 bit values with Hamming weights 2 and 3, and their complements. Add all the constants to a set, $C$, making sure that each constant occurs exactly once. $\#C$ is the number of constants.
1.1 Search strategy
I'm kind of happy that I was unaware of how MurmurHash3 worked as I have pretty much always just whipped out a hash function when I've needed one based on the observations that "multiplication is good for mangling bits upwards and xorshift right works alright for mangling downwards". For me, this has been working well in practice.As I was looking for constants I was using a different energy criteria based on properties 1 and 2 above:
Create a series of low entropy base constants such as $1, 3, 5, 7, 11_{16}, 55_{16}, 5555_{16}$ and so on. Create all 64 bit rotations of these constants along with their bitwise complements, bit order reversed and bit order reversed and complemented. Each base constant gives rise to 4 * 64 possible constants.
Also create all 64 bit values with Hamming weights 2 and 3, and their complements. Add all the constants to a set, $C$, making sure that each constant occurs exactly once. $\#C$ is the number of constants.
In pseudocode:
When this is done, I compute a goodness-of-fit for each $h_c$. If $f(x)$ is a good approximation of a random permutation, $h_c$ should be binomially distributed with $B(n, \frac{1}{2})$.
The goodness of fit-values are stored in an array with $\#C$ values. From this array I compute the mean, $\bar{x}$ and sample standard deviation, $s$. My energy function is simply $\bar{x} + s$. No, I don't have a good theoretical justification for this...
A sounder metric would be to truncate each histogram (bin bits $0, \ldots, k$ and $64 - k, \ldots, 64$ into a low-bin and high-bin) so that the lowest expected frequency would be around 5, then do a chi-square test and note the p-values. The p-values should now look as they were drawn from a $U(0, 1)$ distribution. One problem with this approach is that it would be insensitive to errors where the Hamming distance is very low or very high.
As can be seen from the code above, the compilation of the histogram will take time that is linearly proportional to $n\#C$. If $C$ is large, this is really slow.
To speed things up a bit, for any M, S, R1, R2, I start by using a preliminary value for $n$ as well as a "small" set for $C$. Let's call this preliminary set of constants $C_0$ (currently of size 24384)
and the extended set $C_1$ (currently of size 621408). I now test with the preliminary values. If the energy from the preliminary test is at most 1.1 times the lowest energy seen so far, I run an extensive test with a larger $n$ and $C_1$ instead of $C_0$.
When a new set of constants shows up that are better or close to as good as the old set, the search program tries the M-value, mutated first by single bit flips, then by all possible 2-bit flips and so on. For each candidate M, shifts (S) in the range 24-52 are tried. During the initial search for good constants I noticed that the error landscape isn't smooth at all. In particular decreasing or increasing S by one might lead to a worse energy even if there are shifts that are better. Hence, for each M do a complete (preliminary) test for all S in the range.
Note that both MurmurHash3 and Variant13 have a second order sum-of-squares avalanche statistic that is very far from what should be expected from a randomly chosen permutation. Variant13 fares better in almost all tests than the original MurmurHash3 but much worse with regard to third order avalanche.
I ran TestU01's BigCrush on my own function, called rrmxmx, MurmurHash3 and Variant13 with a simple counter, similar to how the state update is done in SplittableRandom. p-values outside $[10^{−3} , 1 − 10^{−3}]$ were flagged as "suspect" and p-values outside $[10^{−10} , 1 − 10^{−10}]$ were flagged as "failed".
The data in Table 2 below are from using rrmxmx, MurmurHash3 and Variant13 to generate a pseudo random stream in this fashion:
TestU01 works either on 32 bit words, using 30 to 31 bits, or on 64-bit doubles. In the tests below, I split each 64-bit word in two 32-bit halves and use the halves sequentially.
The procedure used in [1] is described in Section 5 of [1]:
Floating in [1] is done by
For a linear congruential generator, right shifting makes sense due to the most significant bits having a longer period than the least significant bits. For a properly designed generator, all output bits should be equally useful.
I have chosen the same number of tests as in [1], 3 runs of BigCrush for each {mixer, gamma and word reversal}-tuple. The runs for a particular {mixer, gamma, reversal}-tuple differ in the starting point chosen; $runIdx \times \gamma \times 2^{40}$, the first run having $runIdx = 0$. By chosing the starting points this way, the runs will be free from overlap since BigCrush uses about $2^{38}$ values.
In [1], the authors give a slightly low number for the expected number of "suspect" tests;
As can be seen in Table 2, both MurmurHash3 and Variant13 fails horribly for small gamma. This is the code for getting a new gamma when splitting in SplittableRandom:
Let H be a histogram with #C * 65 bins.
for i in 1, ..., n // Do n experiments
x = rnd() // Get random number on [0, 2^64)
h = f(x) // f is our mixer
for c in C
y = h XOR f(x XOR c)
p = hammingWeight(y)
cIdx = indexOf(c in C)
h[cIdx][p]' = h[cIdx][p] + 1
end for
end for
for i in 1, ..., n // Do n experiments
x = rnd() // Get random number on [0, 2^64)
h = f(x) // f is our mixer
for c in C
y = h XOR f(x XOR c)
p = hammingWeight(y)
cIdx = indexOf(c in C)
h[cIdx][p]' = h[cIdx][p] + 1
end for
end for
When this is done, I compute a goodness-of-fit for each $h_c$. If $f(x)$ is a good approximation of a random permutation, $h_c$ should be binomially distributed with $B(n, \frac{1}{2})$.
The goodness of fit-values are stored in an array with $\#C$ values. From this array I compute the mean, $\bar{x}$ and sample standard deviation, $s$. My energy function is simply $\bar{x} + s$. No, I don't have a good theoretical justification for this...
A sounder metric would be to truncate each histogram (bin bits $0, \ldots, k$ and $64 - k, \ldots, 64$ into a low-bin and high-bin) so that the lowest expected frequency would be around 5, then do a chi-square test and note the p-values. The p-values should now look as they were drawn from a $U(0, 1)$ distribution. One problem with this approach is that it would be insensitive to errors where the Hamming distance is very low or very high.
As can be seen from the code above, the compilation of the histogram will take time that is linearly proportional to $n\#C$. If $C$ is large, this is really slow.
To speed things up a bit, for any M, S, R1, R2, I start by using a preliminary value for $n$ as well as a "small" set for $C$. Let's call this preliminary set of constants $C_0$ (currently of size 24384)
and the extended set $C_1$ (currently of size 621408). I now test with the preliminary values. If the energy from the preliminary test is at most 1.1 times the lowest energy seen so far, I run an extensive test with a larger $n$ and $C_1$ instead of $C_0$.
When a new set of constants shows up that are better or close to as good as the old set, the search program tries the M-value, mutated first by single bit flips, then by all possible 2-bit flips and so on. For each candidate M, shifts (S) in the range 24-52 are tried. During the initial search for good constants I noticed that the error landscape isn't smooth at all. In particular decreasing or increasing S by one might lead to a worse energy even if there are shifts that are better. Hence, for each M do a complete (preliminary) test for all S in the range.
2. Results
I plugged MurmurHash3 and Stafford's Variant 13 into my search program and both fared badly. The sample means weren't so bad but the sample standard deviations were terrible. Clearly, for some constants $c$ as described above, the Hamming weights of $f(x) \oplus f(x \oplus c)$ were far from binomially distributed. When isolating the tests to only 2-bit mutations, measuring the second order avalanche, both resulted in values far from the expected 1.0.
2.1 An avalanche metric
In [1], Figure 18, this snippet (in Java) is given to compute a first order sum-of-square avalanche statistic, with the bug with 1 vs. 1L as described above corrected:
long[][] A = new long[64][64];
for (long n = 0; n < N; n++) {
long v = rng.nextLong();
long w = mix64(v);
for (int i = 0; i < 64; i++) {
long x = w ^ m.mix(v ^ (1L << i));
for (int j = 0; j < 64; j++) {
if (((x >>> j) & 1) != 0) {
A[i][j] += 1;
}
}
}
}
double sumsq = 0.0;
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
double v = a[j][k] - N / 2.0;
sumsq += v*v;
}
}
double result = sumsq / ((N / 4.0) * 4096.0);
The code above is extended to compute higher order avalanche statistics like this (second order example in C):
for (long n = 0; n < N; n++) {
long v = rng.nextLong();
long w = mix64(v);
for (int i = 0; i < 64; i++) {
long x = w ^ m.mix(v ^ (1L << i));
for (int j = 0; j < 64; j++) {
if (((x >>> j) & 1) != 0) {
A[i][j] += 1;
}
}
}
}
double sumsq = 0.0;
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
double v = a[j][k] - N / 2.0;
sumsq += v*v;
}
}
double result = sumsq / ((N / 4.0) * 4096.0);
double measureAvalanche2_x(uint64_t (*h) (uint64_t), int runExp,
uint64_t inc, int complemented, int bins) {
assert(2016 % bins != 0); // choose(64, 2) == 2016
uint64_t* A = allocA(bins);
uint64_t C = (complemented ? 0xFFFFFFFFFFFFFFFFUL : 0);
uint64_t N = 1L << runExp;
for (uint64_t n = 0; n < N; n++) {
uint64_t v = n * inc;
uint64_t w = h(v);
int p = 0;
for (int i = 0; i < 64; i++) {
for(int j = i + 1; j < 64; j++) {
uint64_t x = w ^ h(v ^ (1L << i) ^ (1L << j) ^ C);
for (int k = 0; k < 64; k++) {
if (((x >> k) & 1) != 0) {
A[p * 64 + k]++;
}
}
p = (p + 1) % bins;
}
}
}
double result = sumsq(A, bins, N * (2016 / bins));
freeA(A);
return result;
}
uint64_t inc, int complemented, int bins) {
assert(2016 % bins != 0); // choose(64, 2) == 2016
uint64_t* A = allocA(bins);
uint64_t C = (complemented ? 0xFFFFFFFFFFFFFFFFUL : 0);
uint64_t N = 1L << runExp;
for (uint64_t n = 0; n < N; n++) {
uint64_t v = n * inc;
uint64_t w = h(v);
int p = 0;
for (int i = 0; i < 64; i++) {
for(int j = i + 1; j < 64; j++) {
uint64_t x = w ^ h(v ^ (1L << i) ^ (1L << j) ^ C);
for (int k = 0; k < 64; k++) {
if (((x >> k) & 1) != 0) {
A[p * 64 + k]++;
}
}
p = (p + 1) % bins;
}
}
}
double result = sumsq(A, bins, N * (2016 / bins));
freeA(A);
return result;
}
2.2 Measurements
The number of possible $k$-bit differences in a 64-bit word is ${64\choose k}$. To be able to run the tests for many instances I wrote a CUDA-program. To create less memory contention, each thread works on a small number of bins instead of the full number. Using the extensions to 2nd, 3rd and 4th order avalanche, with $bins_{1, \ldots, 4} = \{64, 288, 217, 217\}$, here's the avalanche sum-of-squares statistic of order 1, 2, 3, 4. Inputs are the products of $a \times [0, 2^{30})$ for $\Delta_1$, $a \times [0, 2^{25})$ for $\Delta_2$, $a \times [0, 2^{20})$ for $\Delta_3$ and $a \times [0, 2^{20})$ for $\Delta_4$:
Mixer | a = 0x40EAD42CA1CD0131 | |||
---|---|---|---|---|
$\Delta_1$ | $\Delta_2$ | $\Delta_3$ | $\Delta_4$ | |
rrmxmx | 0.975 | 0.992 | 1.039 | 1.005 |
Murmur3 | 1.423 | 11049.99 | 1.003 | 3.004 |
Variant13 | 1.008 | 2131.30 | 25.46 | 1.271 |
Note that both MurmurHash3 and Variant13 have a second order sum-of-squares avalanche statistic that is very far from what should be expected from a randomly chosen permutation. Variant13 fares better in almost all tests than the original MurmurHash3 but much worse with regard to third order avalanche.
2.3 Testing as a PRNG with TestU01
Neither MurmurHash3 or Variant13 had property 3 in the introduction above ("feeding low entropy data should render a random-looking output sequence") as a design goal. Since Variant13 and partly, MurmurHash3, are being used as the mixing functions in Java 8's SplittableRandom, they deserve more scrutiny with regard to that property.I ran TestU01's BigCrush on my own function, called rrmxmx, MurmurHash3 and Variant13 with a simple counter, similar to how the state update is done in SplittableRandom. p-values outside $[10^{−3} , 1 − 10^{−3}]$ were flagged as "suspect" and p-values outside $[10^{−10} , 1 − 10^{−10}]$ were flagged as "failed".
TestU01 is not geared towards testing random bit streams as such but is (mostly) word oriented with 30- and 31-bit words. If one assumes that the output to be tested is a stream of bits rather than a stream of words, it's prudent to at least also feed TestU01 with all output words bit-reversed.
The data in Table 2 below are from using rrmxmx, MurmurHash3 and Variant13 to generate a pseudo random stream in this fashion:
uint64_t next(uint64_t *state, uint64_t gamma) {
uint64_t s = *state;
*state += gamma;
return mix(s);
}
This is fed to TestU01 byuint64_t s = *state;
*state += gamma;
return mix(s);
}
static void runTestsInt(unsigned int (*prng)(void), char name[]) {
unif01_Gen* gen = unif01_CreateExternGenBits(name, prng);
bbattery_BigCrush(gen);
unif01_DeleteExternGenBits(gen);
}
Note that I use all 64 bits of output, without floating, and that this is stricter/more revealing.unif01_Gen* gen = unif01_CreateExternGenBits(name, prng);
bbattery_BigCrush(gen);
unif01_DeleteExternGenBits(gen);
}
TestU01 works either on 32 bit words, using 30 to 31 bits, or on 64-bit doubles. In the tests below, I split each 64-bit word in two 32-bit halves and use the halves sequentially.
"TestU01 expects to test 64-bit double values, so we use 53 out of every 64 bits generated by our algorithm to generate a double value as for our method nextDouble."This masks some serious defects in the 11 least significant bits of MurmurHash3 and Variant13 when $\gamma$ is small. Except for large $\gamma$, BigCrush generates many more suspect/failure-values than expected for MurmurHash3 and Variant13. Note that rrmxmx seems to be unaffected by small $\gamma$.
Floating in [1] is done by
(nextLong() >>> 11) * DOUBLE_ULP
thus discarding the 11 least significant bits. This practice in itself is not any better than using the least significant 53 bits, as in
(nextLong() & ((1L << 53) - 1)) * DOUBLE_ULP
but seems to be common practice anyway.For a linear congruential generator, right shifting makes sense due to the most significant bits having a longer period than the least significant bits. For a properly designed generator, all output bits should be equally useful.
I have chosen the same number of tests as in [1], 3 runs of BigCrush for each {mixer, gamma and word reversal}-tuple. The runs for a particular {mixer, gamma, reversal}-tuple differ in the starting point chosen; $runIdx \times \gamma \times 2^{40}$, the first run having $runIdx = 0$. By chosing the starting points this way, the runs will be free from overlap since BigCrush uses about $2^{38}$ values.
In [1], the authors give a slightly low number for the expected number of "suspect" tests;
“TestU01 regards any p-value outside the range $[10^{-10}, 1 - 10^{-10}]$ as indicating “clear failure” of the tested generator. The TestU01 software flags all p-values outside the range $[10^{-3}, 1 - 10^{-3}]$; such values, if they are not clear failures, are regarded as “suspect.” For each generator variant we ran BigCrush 3 times and we report the total number of suspect p-values and clear failures over all 3 runs; each run of BigCrush performs 160 distinct tests, so 3 runs collectively perform 480 tests. With 480 tests at two-sided p = 0.001, we expect even a perfect generator to exhibit 0.96 “suspect” values on average.”I suspect "160" and "480" are typos. Pages 148--152 of The "Compact user's guide" [3] for TestU01 list the individual tests and parameters, giving a count of 106. According to page 147 of [3], several tests give more than one p-value:
"Some of the tests compute more than one statistic using the same stream of random numbers and these statistics are thus not independent. That is why the number of statistics in the summary reports is larger than the number of tests in the description of the batteries."Assuming the p-values are independent, with 3 runs of with 254 values each and two-sided p = 0.001, we expect a perfect generator to exhibit 1.524 “suspect” values on average.
Mixer | $\gamma$ = 1 | $\gamma$ = 0x55555555 | $\gamma$ = 0xC45A11730CC8FFE3 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Forward | Reverse | Forward | Reverse | Forward | Reverse | |||||||
suspect | failed | suspect | failed | suspect | failed | suspect | failed | suspect | failed | suspect | failed | |
rrmxmx | 3 | 0 | 2 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 2 | 0 |
Murmur3 | 6 | 47 | 9 | 140 | 10 | 4 | 14 | 37 | 1 | 0 | 1 | 0 |
Varian13 | 9 | 24 | 14 | 38 | 9 | 3 | 10 | 1 | 0 | 0 | 1 | 0 |
As can be seen in Table 2, both MurmurHash3 and Variant13 fails horribly for small gamma. This is the code for getting a new gamma when splitting in SplittableRandom:
private static long mixGamma(long z) {
// MurmurHash3 mix constants z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;
z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
z = (z ^ (z >>> 33)) | 1L; // force to be odd
int n = Long.bitCount(z ^ (z >>> 1)); // ensure enough transitions
return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}
Compared to [1], MurmurHash3 and Variant13 have switched purposes, MurmurHash3 is used for new gammas and Variant13 is used for the prng-stream.
Alas, the mixGamma() function above does not prevent bad gammas from occuring. 0x55555555 is, as can be seen by testing, a gamma that fares very badly but is still entirely possible. I do not yet know how large a proportion of the values from mixGamma() are problematic and give rise to a prng-instance that have highly undesirable properties. Melissa O'Neill made the same reflection on bad gammas before me in her blog post "Bugs in SplitMix(es)". They are trivial to construct, the question is at what frequency they are showing up in practice?
// MurmurHash3 mix constants z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;
z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
z = (z ^ (z >>> 33)) | 1L; // force to be odd
int n = Long.bitCount(z ^ (z >>> 1)); // ensure enough transitions
return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}
2.4 Testing as a PRNG with PractRand
I also decided to test with PractRand, V0.93, with options "-tf 2 -te 1 -tlmin 1KB". All values of $\gamma$ in Table 1 of [1] marked "poor" and "okay" fail PractRand after a very small amount of output from MurmurHash3 and Variant13. The gammas marked as "random" in Table 1 of [1] fail with MurmurHash3 as the mixing function after at most $2^{41}$ bytes.
It's clear that Variant13 is more well behaved than MurmurHash3, but Variant3 fails PractRand quite rapidly for many gammas. Are the gammas marked as "random" a fluke or the norm with regards to how well behaved Variant13 is? I'll try some experiments in a blog post to come, trying to estimate the likelihood that SplittableRandom gets a bad gamma.
$\gamma$ | rrmxmx | Murmur3 | Variant13 |
---|---|---|---|
0x0000000000000001
| $2^{44}$ | $2^{16}$ | $2^{21}$ |
0x0000000000000003
| $2^{47.71}$ | $2^{18}$ | $2^{22}$ |
0x0000000000000005
| $> 2^{48}$ | $2^{19}$ | $2^{21}$ |
0x0000000000000009
| $> 2^{48}$ | $2^{19}$ | $2^{22}$ |
0x0000010000000001
| $2^{46.41}$ | $2^{18}$ | $2^{20}$ |
0xffffffffffffffff
| $2^{45.62}$ | $2^{16}$ | $2^{21}$ |
0x0000000000ffffff
| $2^{45.39}$ | $2^{17}$ | $2^{22}$ |
0xffffff0000000001
| $2^{46.55}$ | $2^{18}$ | $2^{22}$ |
0x0000000000555555
| $2^{47.42}$ | $2^{25}$ | $2^{30}$ |
0x1111111111110001
| $> 2^{48}$ | $2^{27}$ | $2^{32}$ |
0x7777777777770001
| $> 2^{48}$ | $2^{28}$ | $2^{34}$ |
0x7f7f7f7f33333333
| $> 2^{48}$ | $2^{31}$ | $2^{33}$ |
0x5555550000000001
| $> 2^{48}$ | $2^{24}$ | $2^{30}$ |
0xc45a11730cc8ffe3
| ? | $2^{39}$ | $> 2^{42}$ |
0x2b13b77d0b289bbd
| ? | $2^{39}$ | $> 2^{42}$ |
0x40ead42ca1cd0131
| ? | $2^{40}$ | $> 2^{42}$ |
After $2^{43}$ bytes, PractRand has yet not complained when running Variant13 with $\gamma = 40EAD42CA1CD0131_{16}$.
rrmxmx fails at the $2^{44}$ level with $\gamma = 1$. With $\gamma = 3$, it passes PractRand up to $2^{45}$ bytes but with one test marked as suspicious at that level;
rrmxmx fails at the $2^{44}$ level with $\gamma = 1$. With $\gamma = 3$, it passes PractRand up to $2^{45}$ bytes but with one test marked as suspicious at that level;
rng=RNG_stdin, seed=0xca839ae9
length= 32 terabytes (2^45 bytes), time= 506449 seconds
Test Name Raw Processed Evaluation
BCFN_FF(2+2):freq R= +8.3 p~= 1e-7 suspicious
...and 1900 test result(s) without anomalies
length= 32 terabytes (2^45 bytes), time= 506449 seconds
Test Name Raw Processed Evaluation
BCFN_FF(2+2):freq R= +8.3 p~= 1e-7 suspicious
...and 1900 test result(s) without anomalies
Considering that the test at the $2^{45}$ level took 140 hours I won't run a test at $2^{46}$ level but some caution is recommended if running rrmxmx as a PRNG with very sparse (1 or 2-bit) gammas. Currently I am running TestU01 for gammas of the form $2^n$ for $0 < n < 25$. Assuming rrmxmx behaves as a randomly chosen permutation, these gammas should pass too. One would of course never use them for a PRNG considering that they shorten the period to $2^{64 - n}$.
3. Speed
The operations in rrmxmx are identical to MurmurHash3 except for the first step; instead of a right shift and xor there are two rotates and 2 xors. With my machine (i7-5820), ILP makes the speed of rrmxmx, MurmurHash3 and Variant13 identical as far as I can measure.
4. Further work
Assuming
x ^= ror64(x, a) ^ ror64(x, b)
is just as fast as
x ^= x >> a
(On my Intel i7-5820 and on AMD Ryzen 7 1700 it is, for many other processors, it isn't.)
A search should be made to see if there could be a better version looking like this:
uint64_t rrmxmx2(uint64_t v) {
uint64_t v2 = v;
v2 ^= ror64(v2, R1) ^ ror64(v2, R2);
v2 *= M1;
v2 ^= ror64(v2, R3) ^ ror64(v2, R4);
v2 *= M2;
return v2 ^ ror64(v2, R5) ^ ror64(v2, R6);
}
uint64_t v2 = v;
v2 ^= ror64(v2, R1) ^ ror64(v2, R2);
v2 *= M1;
v2 ^= ror64(v2, R3) ^ ror64(v2, R4);
v2 *= M2;
return v2 ^ ror64(v2, R5) ^ ror64(v2, R6);
}
The indication that rrmxmx with the constants M = 0x9FB21C651E98DF25, S = 28, R1 = 49, R2 = 24 can be improved is that there is a clear difference when looking at the goodness-of-fit for different constants c as described in section 1.1 above.
5. Conclusion (TL;DR)
If you have the need for a reasonably fast permutation on 64-bit words that looks quite random, this function might be useful. MurmurHash3 and Variant13 are very easy to distinguish from random permutations as shown above.
There may be better constants for MurmurHash13 than David Stafford's improvements; I have not been searching using my energy function.
There may be better constants for MurmurHash13 than David Stafford's improvements; I have not been searching using my energy function.
A. Source code
A.1 C:
#include <stdint.h>
static inline uint64_t ror64(uint64_t v, int r) {
return (v >> r) | (v << (64 - r));
}
uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, 49) ^ ror64(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
uint64_t xorRor24_49Inv(uint64_t v) {
const int rors[] = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
uint64_t acc = 0;
for (int i = 0; i < sizeof(rors) / sizeof(int); i++) {
acc ^= ror64(v, rors[i]);
}
return acc ^ v;
}
static uint64_t xorShift28Inv(uint64_t v) {
return v ^ (v >> 28) ^ (v >> 56);
}
uint64_t unrrmxmx(uint64_t v) {
uint64_t x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return xorRor24_49Inv(x);
}
static inline uint64_t ror64(uint64_t v, int r) {
return (v >> r) | (v << (64 - r));
}
uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, 49) ^ ror64(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
uint64_t xorRor24_49Inv(uint64_t v) {
const int rors[] = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
uint64_t acc = 0;
for (int i = 0; i < sizeof(rors) / sizeof(int); i++) {
acc ^= ror64(v, rors[i]);
}
return acc ^ v;
}
static uint64_t xorShift28Inv(uint64_t v) {
return v ^ (v >> 28) ^ (v >> 56);
}
uint64_t unrrmxmx(uint64_t v) {
uint64_t x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return xorRor24_49Inv(x);
}
A.2 Java:
public static long rrmxmx(long v) {
v ^= Long.rotateRight(v, 49) ^ Long.rotateRight(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >>> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >>> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
public static long unrrmxmx(long v) {
long x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return invertRor2449(x);
}
private static long invertRor2449(long v) {
int[] rots = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
long acc = 0;
for (int r : rots) {
acc ^= Long.rotateRight(v, r);
}
return acc ^ v;
}
private static long xorShift28Inv(long v) {
return v ^ (v >>> 28) ^ (v >>> 56);
}
v ^= Long.rotateRight(v, 49) ^ Long.rotateRight(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >>> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >>> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
public static long unrrmxmx(long v) {
long x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return invertRor2449(x);
}
private static long invertRor2449(long v) {
int[] rots = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
long acc = 0;
for (int r : rots) {
acc ^= Long.rotateRight(v, r);
}
return acc ^ v;
}
private static long xorShift28Inv(long v) {
return v ^ (v >>> 28) ^ (v >>> 56);
}
B. Expected output/test vectors
0x0000000000000000
|
0x0000000000000000
|
0x0000000000000000
|
0x0000000000000000
| |
0x0000000000000001
|
0x23085d6f7a569905
|
0x56ed9162154faac0
|
0x0000000000000001
| |
0x0000000000000003
|
0xcaea878c77a59454
|
0x0ec1bfbe6983c5a0
|
0x0000000000000003
| |
0x0000000000000007
|
0xa77bd5a63a7785c5
|
0x1718113ac9a1f119
|
0x0000000000000007
| |
0x0101010101010101
|
0x36cb9e821eca6c5b
|
0xfa63351a390851cd
|
0x0101010101010101
| |
0x0123456789abcdef
|
0xc337a528d7e42497
|
0x7529d4da142b1f1c
|
0x0123456789abcdef
| |
0x084c2a6e195d3b7f
|
0x507d53f1ba22542c
|
0xec3694cd1c80b9cd
|
0x084c2a6e195d3b7f
| |
0x1000000000000001
|
0xedd3f3f24766de89
|
0xdb302dae3ad882e0
|
0x1000000000000001
| |
0x1111111111111111
|
0x7547f019c63c1df3
|
0xea6d9bbf167027c9
|
0x1111111111111111
| |
0x1fffffffffffffff
|
0x05e3c8367d6677d6
|
0x7fbbf24327033cf0
|
0x1fffffffffffffff
| |
0x3fffffffffffffff
|
0x47e7c1e973d349ff
|
0x240ba915bbb5e089
|
0x3fffffffffffffff
| |
0x6666666666666666
|
0xd9c6e8c9ecd1e30a
|
0xf4b9c6565f8d9529
|
0x6666666666666666
| |
0x7777777777777777
|
0x29823cb92ada0068
|
0xdca549733043f019
|
0x7777777777777777
| |
0x7f7f7f7f7f7f7f7f
|
0xc58024da69c2eb57
|
0xf1d5238b66aaaf5e
|
0x7f7f7f7f7f7f7f7f
| |
0x7ffffffffffffff7
|
0x30c8918fcb6b2b3c
|
0x3a836e49ca560dd8
|
0x7ffffffffffffff7
| |
0x7fffffffffffffff
|
0x91b750beb6849d8f
|
0x90354478a1b6e49d
|
0x7fffffffffffffff
| |
0x8000000000000000
|
0x5e2d59ded82568fc
|
0xa0f3362cbce5bedb
|
0x8000000000000000
| |
0x8000000000000008
|
0xae03d8a5f03d42bb
|
0xed1a6dc89b6e22d2
|
0x8000000000000008
| |
0x8080808080808080
|
0x269ed61ad0d4a3ad
|
0xcf8b0a0dccbf9da9
|
0x8080808080808080
| |
0x8888888888888888
|
0x2f6af135bf8e9d79
|
0x2c50b3a1d5c7a854
|
0x8888888888888888
| |
0x9999999999999999
|
0x50a99564c864eb28
|
0x6ae2b8e14b6d3c7c
|
0x9999999999999999
| |
0xc000000000000000
|
0xf5f0f95fcd968a80
|
0x6ae70fea73bd7a6d
|
0xc000000000000000
| |
0xe000000000000000
|
0x160c347d11027361
|
0x9a3d176b24d68305
|
0xe000000000000000
| |
0xeeeeeeeeeeeeeeee
|
0x9f9714241fb64d9e
|
0x0a40b8632cad4bfa
|
0xeeeeeeeeeeeeeeee
| |
0xeffffffffffffffe
|
0x742025f2e92e6aec
|
0xf7eaaefaaa16ddb8
|
0xeffffffffffffffe
| |
0xf7b3d591e6a2c480
|
0x60f421f08a38d500
|
0xf520f63f955ac204
|
0xf7b3d591e6a2c480
| |
0xfedcba9876543210
|
0x8fec24c21c6d66de
|
0xf18dbb478c6d3943
|
0xfedcba9876543210
| |
0xfefefefefefefefe
|
0x125c8836f02c998f
|
0xe4b673f0521ad37d
|
0xfefefefefefefefe
| |
0xfffffffffffffff8
|
0x6018ed12f08b6eec
|
0x1b32e354639f82f1
|
0xfffffffffffffff8
| |
0xfffffffffffffffc
|
0x420b85f7b23fa512
|
0xe317247fad148210
|
0xfffffffffffffffc
| |
0xfffffffffffffffe
|
0xc320bdd84877d048
|
0x31c9d93c42d48cea
|
0xfffffffffffffffe
| |
0xffffffffffffffff
|
0x8bc57fddf83265bd
|
0xb694bf1eaa6682c4
|
0xffffffffffffffff
|