Mostly mangling: On the mixing functions in "Fast Splittable Pseudorandom Number Generators" , MurmurHash3 and David Stafford's improved variants on the MurmurHash3 finalizer.

fredag 13 juli 2018

On the mixing functions in "Fast Splittable Pseudorandom Number Generators" , MurmurHash3 and David Stafford's improved variants on the MurmurHash3 finalizer.

0. Introduction

For a long time I have been looking for "good" permutations on $[0, 2^{32})$ and $[0, 2^{64})$. The definition of "good" in this context is a permutation that looks like it had been chosen at random from any of the $2^{32}!$ or $2^{64}!$ such possible functions and is reasonably fast. These design criteria are somewhat vague; the identity function is just as random as any other function on the domain. If speed wasn't an issue, any half decent block cipher would fit the bill.

A straightforward (but time consuming) test of "goodness" is to feed a low-entropy source such as a simple counter into the mixing function and passing the output to a test battery for randomness.

Perhaps the most well established set of tools for testing random number generation is L'Ecuyer and Simard's TestU01. It contains three predefined sets of tests ("batteries") that any stream of random words should expect to pass without being flagged as anomalous. For a 32-bit bijection, TestU01's batteries Crush and BigCrush should detect anomalies due to the short period, $2^{32}$, regardless of how well the permutation has been chosen.

A different test-suite that has a much more hackish flavour to it is PractRand. It has a tendency to fail some generators very quickly and takes a quite different approach to testing than TestU01 does; it runs until it fails spectacularly or $2^{46}$ bytes have been inspected.

A randomly chosen permutation $f: 2^b \rightarrow 2^b$ could be expected to have (at least) the following properties ($\oplus$ denoting exclusive or):

For any $0 \le x < 2^b, 0 \le i < b, \Delta_1 = f(x) \oplus f(x \oplus 2^i)$ should have Hamming weights ("number of set bits", a.k.a. population count) that are binomially distributed, $B(b, \frac{1}{2})$.
For any $0 \le x < 2^b$ and tuples $a$ of length $t, 0 \le t < b$ and $0 \le a_k < b, a_k < a_{k + 1}, 1 \le k \le t$, let $s = \sum_{i=1}^t 2^{a_i}$ and let $\Delta_t = f(x) \oplus f(x \oplus s)$.
$\Delta_t$ should now also have Hamming weights with a $B(b, \frac{1}{2})$ distribution.
When being fed a low entropy sequence, say $x_n = n$ or something of similar simplicty, $f(x_i) \Vert f(x_{i + 1}) \Vert \ldots$ should look like a random bit sequence when the output words are concatenated.

The first property I think follows from the strict avalanche criterion, i.e. a function fulfilling the SAC also has binomially distributed Hamming weight-deltas.

The second property should follow from a function fulfilling a $t$-order SAC.

The other day I came across "Fast Splittable Pseudorandom Number Generators" [1] by Guy L. Steele Jr., Doug Lea and Christine H. Flood. Obviously, I am not Mr. Current Affairs since the paper is 4 years old by now. Anyway, I noticed that the mixing operation looked very similar to a family of functions I've been tinkering with, trying to find something that is reasonably fast and that satisfies the properties above. The mixing functions in [1] were in turn borrowed from Austin Appleby's MurmurHash3 and David Stafford's improved variants [2] on MurmurHash3.

MurmurHash3 uses a 64-bit mixer in its final step, looking like this:

uint64_t mix(uint64_t v) {
v ^= (v >> S1);
v *= M1;
v ^= (v >> S2);
v *= M2;
return v ^ (v >> S3);
}

Where S1 = S2 = S3 = 33, M1 = 0xFF51AFD7ED558CCD and M2 = 0xC4CEB9FE1A85EC53.

I can't find Appleby's description on how the search for constants was performed but in section 4 of [1]:
"Appleby used this criterion [SAC, possibly first order?] as the “energy function” in a simulated annealing process to explore a specific space of potential 8-operation bit-mixing functions (both 32-bit and 64-bit versions."

Stafford chose a different search strategy than Appleby: trying low entropy data (instead of random numbers) in his inputs to the mixing function, assuming that a function that behaves well on low entropy inputs also would be well behaved for high entropy inputs. According to his search, that assumption was correct. He does, however, not specify what the low entropy inputs were or if he measured $f(x)$ or $f(x) \oplus f(x \oplus c)$ for low entropy (single bit?) constants $c$.

The authors of [1] give a short snippet in Fig. 18, that gives a sum-of-squares [first order] avalanche statistic with randomized inputs, flipping each bit to compute $\Delta_1$ as in property 1 above. They stick to the constants from MurmurHash3 and includes Variant13 from Stafford since they conclude that they may be close to optimal with regards to $\Delta_1$.

Fig. 18 of [1] , contains a glaring bug by the way, maybe that's not the code they used when doing their search but a simple error that got introduced when formatting the code for the paper:

long x = w ^ m.mix(v ^ (1 << i));

should of course be:

long x = w ^ m.mix(v ^ (1L << i));)

Anyhow, what is nice about xor-shifting is that it's guaranteed to be bijective on $2^b$ as long as the amount shifted, $s$, is $0 < s < b$. (This can be verified by checking that the binary matrices of $(I + L^s)$ and $(I + R^s)$ have inverses.)

What is not so nice, in the context of creating mixers, is that low input Hamming weights tends to stay low, thus not mixing things as fast as we would like.

Multiplication is a really nice operation in this context; the topmost bits are influence by all bits of lower significance. As long as the constant is odd and the multiplication is done modulo $2^b$, an inverse exists and the operation is thus bijective. What's not so nice is that high input bits can't influence low order bits (when the modulus is on the form $2^b$).

Doing an interleaving of multiplying and xorshifting (to the right) is a really brilliant idea; multiplying makes all low order bits influence the top most bits. Xorshifting folds the topmost bits down to the right. And the defect that xorshifting has w.r.t. Hamming weights is mostly remedied by multiplication.

Anyway, I was looking for a better way to increase the avalanche rate but without having more than 2 multiplication/xorshift steps and figured that

x ^= ror(x, a) ^ ror(x, b)

is bijective when $0 \le x < 2^{64}, 0 < a < 64, 0 < b < 64, a \ne b$, which looked promising as a preliminary step.
I first saw a very similar construction in SHA-512, but as y = ror(x, a) ^ ror(x, b) ^ x >> c. The snippet with two rors above is the special case where the last term is omitted.

1. A better mixer

Most recently I have come up with this:

M = 0x9fb21c651e98df25L, S = 28, R1 = 49, R2 = 24

uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, R1) ^ ror64(v, R2);
v *= M;
v ^= v >> S;
v *= M;

return v ^ v >> S;
}

Before I get to the advantages of using this over MurmurHash3's finalizer or Stafford's variants, I'll describe how I found the constants.

1.1 Search strategy

I'm kind of happy that I was unaware of how MurmurHash3 worked as I have pretty much always just whipped out a hash function when I've needed one based on the observations that "multiplication is good for mangling bits upwards and xorshift right works alright for mangling downwards". For me, this has been working well in practice.

As I was looking for constants I was using a different energy criteria based on properties 1 and 2 above:
Create a series of low entropy base constants such as $1, 3, 5, 7, 11_{16}, 55_{16}, 5555_{16}$ and so on. Create all 64 bit rotations of these constants along with their bitwise complements, bit order reversed and bit order reversed and complemented. Each base constant gives rise to 4 * 64 possible constants.
Also create all 64 bit values with Hamming weights 2 and 3, and their complements. Add all the constants to a set, $C$, making sure that each constant occurs exactly once. $\#C$ is the number of constants.

In pseudocode:

Let H be a histogram with #C * 65 bins.
for i in 1, ..., n // Do n experiments
x = rnd() // Get random number on [0, 2^64)
h = f(x) // f is our mixer
for c in C
y = h XOR f(x XOR c)
p = hammingWeight(y)
cIdx = indexOf(c in C)
h[cIdx][p]' = h[cIdx][p] + 1
end for
end for

When this is done, I compute a goodness-of-fit for each $h_c$. If $f(x)$ is a good approximation of a random permutation, $h_c$ should be binomially distributed with $B(n, \frac{1}{2})$.
The goodness of fit-values are stored in an array with $\#C$ values. From this array I compute the mean, $\bar{x}$ and sample standard deviation, $s$. My energy function is simply $\bar{x} + s$. No, I don't have a good theoretical justification for this...

A sounder metric would be to truncate each histogram (bin bits $0, \ldots, k$ and $64 - k, \ldots, 64$ into a low-bin and high-bin) so that the lowest expected frequency would be around 5, then do a chi-square test and note the p-values. The p-values should now look as they were drawn from a $U(0, 1)$ distribution. One problem with this approach is that it would be insensitive to errors where the Hamming distance is very low or very high.

As can be seen from the code above, the compilation of the histogram will take time that is linearly proportional to $n\#C$. If $C$ is large, this is really slow.
To speed things up a bit, for any M, S, R1, R2, I start by using a preliminary value for $n$ as well as a "small" set for $C$. Let's call this preliminary set of constants $C_0$ (currently of size 24384)
and the extended set $C_1$ (currently of size 621408). I now test with the preliminary values. If the energy from the preliminary test is at most 1.1 times the lowest energy seen so far, I run an extensive test with a larger $n$ and $C_1$ instead of $C_0$.

When a new set of constants shows up that are better or close to as good as the old set, the search program tries the M-value, mutated first by single bit flips, then by all possible 2-bit flips and so on. For each candidate M, shifts (S) in the range 24-52 are tried. During the initial search for good constants I noticed that the error landscape isn't smooth at all. In particular decreasing or increasing S by one might lead to a worse energy even if there are shifts that are better. Hence, for each M do a complete (preliminary) test for all S in the range.

2. Results

I plugged MurmurHash3 and Stafford's Variant 13 into my search program and both fared badly. The sample means weren't so bad but the sample standard deviations were terrible. Clearly, for some constants $c$ as described above, the Hamming weights of $f(x) \oplus f(x \oplus c)$ were far from binomially distributed. When isolating the tests to only 2-bit mutations, measuring the second order avalanche, both resulted in values far from the expected 1.0.

2.1 An avalanche metric

In [1], Figure 18, this snippet (in Java) is given to compute a first order sum-of-square avalanche statistic, with the bug with 1 vs. 1L as described above corrected:

long[][] A = new long[64][64];
for (long n = 0; n < N; n++) {
long v = rng.nextLong();
long w = mix64(v);
for (int i = 0; i < 64; i++) {
long x = w ^ m.mix(v ^ (1L << i));
for (int j = 0; j < 64; j++) {
if (((x >>> j) & 1) != 0) {
A[i][j] += 1;
}
}
}
}
double sumsq = 0.0;
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
double v = a[j][k] - N / 2.0;
sumsq += v*v;
}
}
double result = sumsq / ((N / 4.0) * 4096.0);

The code above is extended to compute higher order avalanche statistics like this (second order example in C):

double measureAvalanche2_x(uint64_t (*h) (uint64_t), int runExp,
uint64_t inc, int complemented, int bins) {
assert(2016 % bins != 0); // choose(64, 2) == 2016
uint64_t* A = allocA(bins);
uint64_t C = (complemented ? 0xFFFFFFFFFFFFFFFFUL : 0);
uint64_t N = 1L << runExp;
for (uint64_t n = 0; n < N; n++) {
uint64_t v = n * inc;
uint64_t w = h(v);
int p = 0;
for (int i = 0; i < 64; i++) {
for(int j = i + 1; j < 64; j++) {
uint64_t x = w ^ h(v ^ (1L << i) ^ (1L << j) ^ C);
for (int k = 0; k < 64; k++) {
if (((x >> k) & 1) != 0) {
A[p * 64 + k]++;
}
}
p = (p + 1) % bins;
}
}
}
double result = sumsq(A, bins, N * (2016 / bins));
freeA(A);
return result;
}

2.2 Measurements

The number of possible $k$-bit differences in a 64-bit word is ${64\choose k}$. To be able to run the tests for many instances I wrote a CUDA-program. To create less memory contention, each thread works on a small number of bins instead of the full number. Using the extensions to 2nd, 3rd and 4th order avalanche, with $bins_{1, \ldots, 4} = \{64, 288, 217, 217\}$, here's the avalanche sum-of-squares statistic of order 1, 2, 3, 4. Inputs are the products of $a \times [0, 2^{30})$ for $\Delta_1$, $a \times [0, 2^{25})$ for $\Delta_2$, $a \times [0, 2^{20})$ for $\Delta_3$ and $a \times [0, 2^{20})$ for $\Delta_4$:

Table 1: Avalanche statistics
Mixer	a = 0x40EAD42CA1CD0131
Mixer	$\Delta_1$	$\Delta_2$	$\Delta_3$	$\Delta_4$
rrmxmx	0.975	0.992	1.039	1.005
Murmur3	1.423	11049.99	1.003	3.004
Variant13	1.008	2131.30	25.46	1.271

Note that both MurmurHash3 and Variant13 have a second order sum-of-squares avalanche statistic that is very far from what should be expected from a randomly chosen permutation. Variant13 fares better in almost all tests than the original MurmurHash3 but much worse with regard to third order avalanche.

2.3 Testing as a PRNG with TestU01

Neither MurmurHash3 or Variant13 had property 3 in the introduction above ("feeding low entropy data should render a random-looking output sequence") as a design goal. Since Variant13 and partly, MurmurHash3, are being used as the mixing functions in Java 8's SplittableRandom, they deserve more scrutiny with regard to that property.
I ran TestU01's BigCrush on my own function, called rrmxmx, MurmurHash3 and Variant13 with a simple counter, similar to how the state update is done in SplittableRandom. p-values outside $[10^{−3} , 1 − 10^{−3}]$ were flagged as "suspect" and p-values outside $[10^{−10} , 1 − 10^{−10}]$ were flagged as "failed".

TestU01 is not geared towards testing random bit streams as such but is (mostly) word oriented with 30- and 31-bit words. If one assumes that the output to be tested is a stream of bits rather than a stream of words, it's prudent to at least also feed TestU01 with all output words bit-reversed.

The data in Table 2 below are from using rrmxmx, MurmurHash3 and Variant13 to generate a pseudo random stream in this fashion:

uint64_t next(uint64_t *state, uint64_t gamma) {
uint64_t s = *state;
*state += gamma;
return mix(s);
}

This is fed to TestU01 by

static void runTestsInt(unsigned int (*prng)(void), char name[]) {
unif01_Gen* gen = unif01_CreateExternGenBits(name, prng);
bbattery_BigCrush(gen);

unif01_DeleteExternGenBits(gen);
}

Note that I use all 64 bits of output, without floating, and that this is stricter/more revealing.
TestU01 works either on 32 bit words, using 30 to 31 bits, or on 64-bit doubles. In the tests below, I split each 64-bit word in two 32-bit halves and use the halves sequentially.

The procedure used in [1] is described in Section 5 of [1]:

"TestU01 expects to test 64-bit double values, so we use 53 out of every 64 bits generated by our algorithm to generate a double value as for our method nextDouble."

This masks some serious defects in the 11 least significant bits of MurmurHash3 and Variant13 when $\gamma$ is small. Except for large $\gamma$, BigCrush generates many more suspect/failure-values than expected for MurmurHash3 and Variant13. Note that rrmxmx seems to be unaffected by small $\gamma$.
Floating in [1] is done by

(nextLong() >>> 11) * DOUBLE_ULP

thus discarding the 11 least significant bits. This practice in itself is not any better than using the least significant 53 bits, as in

(nextLong() & ((1L << 53) - 1)) * DOUBLE_ULP

but seems to be common practice anyway.
For a linear congruential generator, right shifting makes sense due to the most significant bits having a longer period than the least significant bits. For a properly designed generator, all output bits should be equally useful.

I have chosen the same number of tests as in [1], 3 runs of BigCrush for each {mixer, gamma and word reversal}-tuple. The runs for a particular {mixer, gamma, reversal}-tuple differ in the starting point chosen; $runIdx \times \gamma \times 2^{40}$, the first run having $runIdx = 0$. By chosing the starting points this way, the runs will be free from overlap since BigCrush uses about $2^{38}$ values.

In [1], the authors give a slightly low number for the expected number of "suspect" tests;

“TestU01 regards any p-value outside the range $[10^{-10}, 1 - 10^{-10}]$ as indicating “clear failure” of the tested generator. The TestU01 software flags all p-values outside the range $[10^{-3}, 1 - 10^{-3}]$; such values, if they are not clear failures, are regarded as “suspect.” For each generator variant we ran BigCrush 3 times and we report the total number of suspect p-values and clear failures over all 3 runs; each run of BigCrush performs 160 distinct tests, so 3 runs collectively perform 480 tests. With 480 tests at two-sided p = 0.001, we expect even a perfect generator to exhibit 0.96 “suspect” values on average.”

I suspect "160" and "480" are typos. Pages 148--152 of The "Compact user's guide" [3] for TestU01 list the individual tests and parameters, giving a count of 106. According to page 147 of [3], several tests give more than one p-value:

"Some of the tests compute more than one statistic using the same stream of random numbers and these statistics are thus not independent. That is why the number of statistics in the summary reports is larger than the number of tests in the description of the batteries."

Assuming the p-values are independent, with 3 runs of with 254 values each and two-sided p = 0.001, we expect a perfect generator to exhibit 1.524 “suspect” values on average.

Table 2: TestU01 BigCrush Failures
Mixer	$\gamma$ = 1				$\gamma$ = 0x55555555				$\gamma$ = 0xC45A11730CC8FFE3
	Forward		Reverse		Forward		Reverse		Forward		Reverse
	suspect	failed	suspect	failed	suspect	failed	suspect	failed	suspect	failed	suspect	failed
rrmxmx	3	0	2	0	2	0	1	0	1	0	2	0
Murmur3	6	47	9	140	10	4	14	37	1	0	1	0
Varian13	9	24	14	38	9	3	10	1	0	0	1	0

As can be seen in Table 2, both MurmurHash3 and Variant13 fails horribly for small gamma. This is the code for getting a new gamma when splitting in SplittableRandom:

private static long mixGamma(long z) {
// MurmurHash3 mix constants z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;
z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
z = (z ^ (z >>> 33)) | 1L; // force to be odd
int n = Long.bitCount(z ^ (z >>> 1)); // ensure enough transitions
return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}

Compared to [1], MurmurHash3 and Variant13 have switched purposes, MurmurHash3 is used for new gammas and Variant13 is used for the prng-stream. Alas, the mixGamma() function above does not prevent bad gammas from occuring. 0x55555555 is, as can be seen by testing, a gamma that fares very badly but is still entirely possible. I do not yet know how large a proportion of the values from mixGamma() are problematic and give rise to a prng-instance that have highly undesirable properties. Melissa O'Neill made the same reflection on bad gammas before me in her blog post "Bugs in SplitMix(es)". They are trivial to construct, the question is at what frequency they are showing up in practice?

2.4 Testing as a PRNG with PractRand

I also decided to test with PractRand, V0.93, with options "-tf 2 -te 1 -tlmin 1KB". All values of $\gamma$ in Table 1 of [1] marked "poor" and "okay" fail PractRand after a very small amount of output from MurmurHash3 and Variant13. The gammas marked as "random" in Table 1 of [1] fail with MurmurHash3 as the mixing function after at most $2^{41}$ bytes. It's clear that Variant13 is more well behaved than MurmurHash3, but Variant3 fails PractRand quite rapidly for many gammas. Are the gammas marked as "random" a fluke or the norm with regards to how well behaved Variant13 is? I'll try some experiments in a blog post to come, trying to estimate the likelihood that SplittableRandom gets a bad gamma.

Table 3: Test length where PractRand fails
$\gamma$	rrmxmx	Murmur3	Variant13
0x0000000000000001	$2^{44}$	$2^{16}$	$2^{21}$
0x0000000000000003	$2^{47.71}$	$2^{18}$	$2^{22}$
0x0000000000000005	$> 2^{48}$	$2^{19}$	$2^{21}$
0x0000000000000009	$> 2^{48}$	$2^{19}$	$2^{22}$
0x0000010000000001	$2^{46.41}$	$2^{18}$	$2^{20}$
0xffffffffffffffff	$2^{45.62}$	$2^{16}$	$2^{21}$
0x0000000000ffffff	$2^{45.39}$	$2^{17}$	$2^{22}$
0xffffff0000000001	$2^{46.55}$	$2^{18}$	$2^{22}$
0x0000000000555555	$2^{47.42}$	$2^{25}$	$2^{30}$
0x1111111111110001	$> 2^{48}$	$2^{27}$	$2^{32}$
0x7777777777770001	$> 2^{48}$	$2^{28}$	$2^{34}$
0x7f7f7f7f33333333	$> 2^{48}$	$2^{31}$	$2^{33}$
0x5555550000000001	$> 2^{48}$	$2^{24}$	$2^{30}$
0xc45a11730cc8ffe3	?	$2^{39}$	$> 2^{42}$
0x2b13b77d0b289bbd	?	$2^{39}$	$> 2^{42}$
0x40ead42ca1cd0131	?	$2^{40}$	$> 2^{42}$

After $2^{43}$ bytes, PractRand has yet not complained when running Variant13 with $\gamma = 40EAD42CA1CD0131_{16}$.

rrmxmx fails at the $2^{44}$ level with $\gamma = 1$. With $\gamma = 3$, it passes PractRand up to $2^{45}$ bytes but with one test marked as suspicious at that level;

rng=RNG_stdin, seed=0xca839ae9
length= 32 terabytes (2^45 bytes), time= 506449 seconds
  Test Name Raw Processed Evaluation
  BCFN_FF(2+2):freq R= +8.3 p~= 1e-7 suspicious
  ...and 1900 test result(s) without anomalies

Considering that the test at the $2^{45}$ level took 140 hours I won't run a test at $2^{46}$ level but some caution is recommended if running rrmxmx as a PRNG with very sparse (1 or 2-bit) gammas. Currently I am running TestU01 for gammas of the form $2^n$ for $0 < n < 25$. Assuming rrmxmx behaves as a randomly chosen permutation, these gammas should pass too. One would of course never use them for a PRNG considering that they shorten the period to $2^{64 - n}$.

3. Speed

The operations in rrmxmx are identical to MurmurHash3 except for the first step; instead of a right shift and xor there are two rotates and 2 xors. With my machine (i7-5820), ILP makes the speed of rrmxmx, MurmurHash3 and Variant13 identical as far as I can measure.

4. Further work

Assuming

x ^= ror64(x, a) ^ ror64(x, b)

is just as fast as

x ^= x >> a

(On my Intel i7-5820 and on AMD Ryzen 7 1700 it is, for many other processors, it isn't.)

A search should be made to see if there could be a better version looking like this:

uint64_t rrmxmx2(uint64_t v) {
uint64_t v2 = v;
v2 ^= ror64(v2, R1) ^ ror64(v2, R2);
v2 *= M1;
v2 ^= ror64(v2, R3) ^ ror64(v2, R4);
v2 *= M2;

return v2 ^ ror64(v2, R5) ^ ror64(v2, R6);
}

The indication that rrmxmx with the constants M = 0x9FB21C651E98DF25, S = 28, R1 = 49, R2 = 24 can be improved is that there is a clear difference when looking at the goodness-of-fit for different constants c as described in section 1.1 above.

5. Conclusion (TL;DR)

If you have the need for a reasonably fast permutation on 64-bit words that looks quite random, this function might be useful. MurmurHash3 and Variant13 are very easy to distinguish from random permutations as shown above.

There may be better constants for MurmurHash13 than David Stafford's improvements; I have not been searching using my energy function.

A. Source code

A.1 C:

#include <stdint.h>

static inline uint64_t ror64(uint64_t v, int r) {
return (v >> r) | (v << (64 - r));
}

uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, 49) ^ ror64(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >> 28;
}

// Code below only necessary if one wants to compute the inverse rrmxmx.
uint64_t xorRor24_49Inv(uint64_t v) {
const int rors[] = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
uint64_t acc = 0;
for (int i = 0; i < sizeof(rors) / sizeof(int); i++) {
acc ^= ror64(v, rors[i]);
}

return acc ^ v;
}

static uint64_t xorShift28Inv(uint64_t v) {
return v ^ (v >> 28) ^ (v >> 56);
}

uint64_t unrrmxmx(uint64_t v) {
uint64_t x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return xorRor24_49Inv(x);
}

A.2 Java:

public static long rrmxmx(long v) {
v ^= Long.rotateRight(v, 49) ^ Long.rotateRight(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >>> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >>> 28;
}

// Code below only necessary if one wants to compute the inverse rrmxmx.
public static long unrrmxmx(long v) {
long x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return invertRor2449(x);
}

private static long invertRor2449(long v) {
int[] rots = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
long acc = 0;
for (int r : rots) {
acc ^= Long.rotateRight(v, r);
}

return acc ^ v;
}

private static long xorShift28Inv(long v) {
return v ^ (v >>> 28) ^ (v >>> 56);
}

B. Expected output/test vectors

0x0000000000000000	0x0000000000000000	0x0000000000000000	0x0000000000000000
0x0000000000000001	0x23085d6f7a569905	0x56ed9162154faac0	0x0000000000000001
0x0000000000000003	0xcaea878c77a59454	0x0ec1bfbe6983c5a0	0x0000000000000003
0x0000000000000007	0xa77bd5a63a7785c5	0x1718113ac9a1f119	0x0000000000000007
0x0101010101010101	0x36cb9e821eca6c5b	0xfa63351a390851cd	0x0101010101010101
0x0123456789abcdef	0xc337a528d7e42497	0x7529d4da142b1f1c	0x0123456789abcdef
0x084c2a6e195d3b7f	0x507d53f1ba22542c	0xec3694cd1c80b9cd	0x084c2a6e195d3b7f
0x1000000000000001	0xedd3f3f24766de89	0xdb302dae3ad882e0	0x1000000000000001
0x1111111111111111	0x7547f019c63c1df3	0xea6d9bbf167027c9	0x1111111111111111
0x1fffffffffffffff	0x05e3c8367d6677d6	0x7fbbf24327033cf0	0x1fffffffffffffff
0x3fffffffffffffff	0x47e7c1e973d349ff	0x240ba915bbb5e089	0x3fffffffffffffff
0x6666666666666666	0xd9c6e8c9ecd1e30a	0xf4b9c6565f8d9529	0x6666666666666666
0x7777777777777777	0x29823cb92ada0068	0xdca549733043f019	0x7777777777777777
0x7f7f7f7f7f7f7f7f	0xc58024da69c2eb57	0xf1d5238b66aaaf5e	0x7f7f7f7f7f7f7f7f
0x7ffffffffffffff7	0x30c8918fcb6b2b3c	0x3a836e49ca560dd8	0x7ffffffffffffff7
0x7fffffffffffffff	0x91b750beb6849d8f	0x90354478a1b6e49d	0x7fffffffffffffff
0x8000000000000000	0x5e2d59ded82568fc	0xa0f3362cbce5bedb	0x8000000000000000
0x8000000000000008	0xae03d8a5f03d42bb	0xed1a6dc89b6e22d2	0x8000000000000008
0x8080808080808080	0x269ed61ad0d4a3ad	0xcf8b0a0dccbf9da9	0x8080808080808080
0x8888888888888888	0x2f6af135bf8e9d79	0x2c50b3a1d5c7a854	0x8888888888888888
0x9999999999999999	0x50a99564c864eb28	0x6ae2b8e14b6d3c7c	0x9999999999999999
0xc000000000000000	0xf5f0f95fcd968a80	0x6ae70fea73bd7a6d	0xc000000000000000
0xe000000000000000	0x160c347d11027361	0x9a3d176b24d68305	0xe000000000000000
0xeeeeeeeeeeeeeeee	0x9f9714241fb64d9e	0x0a40b8632cad4bfa	0xeeeeeeeeeeeeeeee
0xeffffffffffffffe	0x742025f2e92e6aec	0xf7eaaefaaa16ddb8	0xeffffffffffffffe
0xf7b3d591e6a2c480	0x60f421f08a38d500	0xf520f63f955ac204	0xf7b3d591e6a2c480
0xfedcba9876543210	0x8fec24c21c6d66de	0xf18dbb478c6d3943	0xfedcba9876543210
0xfefefefefefefefe	0x125c8836f02c998f	0xe4b673f0521ad37d	0xfefefefefefefefe
0xfffffffffffffff8	0x6018ed12f08b6eec	0x1b32e354639f82f1	0xfffffffffffffff8
0xfffffffffffffffc	0x420b85f7b23fa512	0xe317247fad148210	0xfffffffffffffffc
0xfffffffffffffffe	0xc320bdd84877d048	0x31c9d93c42d48cea	0xfffffffffffffffe
0xffffffffffffffff	0x8bc57fddf83265bd	0xb694bf1eaa6682c4	0xffffffffffffffff

22 kommentarer:

cherio14 september 2018 kl. 10:40
This is a great writeup. I am having a terrible feeling the question I am about to ask may sound ridiculously stupid. Why is function "y = x ^ ror(x, a) ^ ror(x, b)" bijective? A simple case when x = 101, a = 1, b = 2 results in y = 000; similarly when x = 110 or x = 011 the function also produces y = 000. I am almost convinced I am missing something
SvaraRadera
Svar
Tommy Ettinger22 januari 2019 kl. 13:52
I found an interesting result based on this post, using xor-rotate-rotate but omitting one of the xorshifts. In this code, rotate64 is a left-rotation defined in a macro:

uint64_t state = UINT64_C(0xEB5F2AE4); // any should work, this was tested
uint64_t random64() {
uint64_t z = (++state ^ UINT64_C(0xDB4F0B9175AE2165)) * UINT64_C(0x4823A80B2006E21B);
z = (z ^ rotate64(z, 52) ^ rotate64(z, 21) ^ UINT64_C(0x9E3779B97F4A7C15)) * UINT64_C(0x81383173);
return z ^ z >> 28;
}

(I probably made some elementary C++ mistakes in the UINT64_C usage.) This is one of very few gamma=1 unary hashes I have tested that passes 32TB of PractRand, and is probably the fastest of the ones I have found due to omitting one of the xorshift-multiply lines. Any multiplications use an XLCG's step of a XOR with a constant that has the last three bits 101, then a a multiplication by a constant with last three bits 011 (this makes an input of 0 produce a non-zero output). The initial step acts like a multiplication by a large gamma, but the XOR preceding it means that the modular multiplicative inverse of the gamma can't be used accidentally as the stride between inputs. I don't know why this set of rotation amounts, xors, shifts, and multiplies works but most others, when substituted in, don't. Some multipliers were chosen for having a low population count, in the hope that they would be faster. The XOR constants are 2 to the 64 divided by generalizations of the golden ratio, with the last bit adjusted to be odd. I hope this hash can be useful or interesting!
SvaraRadera
Svar
orlp4 juli 2019 kl. 07:10
Den här kommentaren har tagits bort av skribenten.
SvaraRadera
Svar
orlp4 juli 2019 kl. 07:11
Hey Pelle, interesting work. Do you also have variants for 32-bit mixing?
SvaraRadera
Svar
Alex22 januari 2023 kl. 14:38
Hello, you're doing a great job and I enjoyed reading your block. What license do you publish your rmxmx code under? Can I use it in my projects?
SvaraRadera
Svar
John K6 januari 2025 kl. 07:27
Pelle,

Will you expand on "Create a series of low entropy base constants such as 1,3,5,7,1116,5516,555516 and so on" or perhaps just throw your constant file up on Github? It would be great to have a fully defined testing methodology and the constants are the last bit of secret sauce.

Even if not, thank you for writing as much as you did. It's an excellent reference.
SvaraRadera
Svar
Pelle Evensen8 januari 2025 kl. 09:00
Den här kommentaren har tagits bort av skribenten.
SvaraRadera
Svar
Daniel Philip Watson23 januari 2026 kl. 20:37
what speedup did you get with the cuda program?
SvaraRadera
Svar

Lägg till kommentar

Mostly mangling