fredag 13 juli 2018

On the mixing functions in "Fast Splittable Pseudorandom Number Generators" , MurmurHash3 and David Stafford's improved variants on the MurmurHash3 finalizer.


0. Introduction

For a long time I have been looking for "good" permutations on $[0, 2^{32})$ and $[0, 2^{64})$. The definition of "good" in this context is a permutation that looks like it had been chosen at random from any of the $2^{32}!$ or $2^{64}!$ such possible functions and is reasonably fast. These design criteria are somewhat vague; the identity function is just as random as any other function on the domain. If speed wasn't an issue, any half decent block cipher would fit the bill.

A straightforward (but time consuming) test of "goodness" is to feed a low-entropy source such as a simple counter into the mixing function and passing the output to a test battery for randomness.

Perhaps the most well established set of tools for testing random number generation is L'Ecuyer and Simard's TestU01. It contains three predefined sets of tests ("batteries") that any stream of random words should expect to pass without being flagged as anomalous. For a 32-bit bijection, TestU01's batteries Crush and BigCrush should detect anomalies due to the short period, $2^{32}$, regardless of how well the permutation has been chosen.

A different test-suite that has a much more hackish flavour to it is PractRand. It has a tendency to fail some generators very quickly and takes a quite different approach to testing than TestU01 does; it runs until it fails spectacularly or $2^{46}$ bytes have been inspected.

A randomly chosen permutation $f: 2^b \rightarrow 2^b$ could be expected to have (at least) the following properties ($\oplus$ denoting exclusive or):
  1. For any $0 \le x < 2^b, 0 \le i < b, \Delta_1 = f(x) \oplus f(x \oplus 2^i)$ should have Hamming weights ("number of set bits", a.k.a. population count) that are binomially distributed, $B(b, \frac{1}{2})$.
  2. For any $0 \le x < 2^b$ and tuples $a$ of length $t, 0 \le t < b$ and $0 \le a_k < b, a_k < a_{k + 1}, 1 \le k \le t$, let $s = \sum_{i=1}^t 2^{a_i}$ and let $\Delta_t = f(x) \oplus f(x \oplus s)$.
    $\Delta_t$ should now also have Hamming weights with a $B(b, \frac{1}{2})$ distribution.
  3. When being fed a low entropy sequence, say $x_n = n$ or something of similar simplicty, $f(x_i) \Vert f(x_{i + 1}) \Vert \ldots$ should look like a random bit sequence when the output words are concatenated. 
The first property I think follows from the strict avalanche criterion, i.e. a function fulfilling the SAC also has binomially distributed Hamming weight-deltas.
The second property should follow from a function fulfilling a $t$-order SAC.

The other day I came across "Fast Splittable Pseudorandom Number Generators" [1] by Guy L. Steele Jr., Doug Lea and Christine H. Flood. Obviously, I am not Mr. Current Affairs since the paper is 4 years old by now. Anyway, I noticed that the mixing operation looked very similar to a family of functions I've been tinkering with, trying to find something that is reasonably fast and that satisfies the properties above. The mixing functions in [1] were in turn borrowed from Austin Appleby's MurmurHash3 and David Stafford's improved variants [2] on MurmurHash3.

MurmurHash3 uses a 64-bit mixer in its final step, looking like this:
uint64_t mix(uint64_t v) {
  v ^= (v >> S1);
  v *= M1;
  v ^= (v >> S2);
  v *= M2;
  return v ^ (v >> S3);
}
Where S1 = S2 = S3 = 33, M1 = 0xFF51AFD7ED558CCD and M2 = 0xC4CEB9FE1A85EC53.
I can't find Appleby's description on how the search for constants was performed but in section 4 of [1]:
"Appleby used this criterion [SAC, possibly first order?] as the “energy function” in a simulated annealing process to explore a specific space of potential 8-operation bit-mixing functions (both 32-bit and 64-bit versions."

Stafford chose a different search strategy than Appleby: trying low entropy data (instead of random numbers) in his inputs to the mixing function, assuming that a function that behaves well on low entropy inputs also would be well behaved for high entropy inputs. According to his search, that assumption was correct. He does, however, not specify what the low entropy inputs were or if he measured $f(x)$ or $f(x) \oplus f(x \oplus c)$ for low entropy (single bit?) constants $c$.

The authors of [1] give a short snippet in Fig. 18, that gives a sum-of-squares [first order] avalanche statistic with randomized inputs, flipping each bit to compute $\Delta_1$ as in property 1 above. They stick to the constants from MurmurHash3 and includes Variant13 from Stafford since they conclude that they may be close to optimal with regards to $\Delta_1$.

Fig. 18 of [1] , contains a glaring bug by the way, maybe that's not the code they used when doing their search but a simple error that got introduced when formatting the code for the paper:
  long x = w ^ m.mix(v ^ (1 << i));
should of course be:
  long x = w ^ m.mix(v ^ (1L << i));)

Anyhow, what is nice about xor-shifting is that it's guaranteed to be bijective on $2^b$ as long as the amount shifted, $s$, is $0 < s < b$. (This can be verified by checking that the binary matrices of $(I + L^s)$ and $(I + R^s)$ have inverses.)

What is not so nice, in the context of creating mixers, is that low input Hamming weights tends to stay low, thus not mixing things as fast as we would like.

Multiplication is a really nice operation in this context; the topmost bits are influence by all bits of lower significance.  As long as the constant is odd and the multiplication is done modulo $2^b$, an inverse exists and the operation is thus bijective. What's not so nice is that high input bits can't influence low order bits (when the modulus is on the form $2^b$).

Doing an interleaving of multiplying and xorshifting (to the right) is a really brilliant idea; multiplying makes all low order bits influence the top most bits. Xorshifting folds the topmost bits down to the right. And the defect that xorshifting has w.r.t. Hamming weights is mostly remedied by multiplication.

Anyway, I was looking for a better way to increase the avalanche rate but without having more than 2 multiplication/xorshift steps and figured that
x ^= ror(x, a) ^ ror(x, b)
is bijective when $0 \le x < 2^{64}, 0 < a < 64, 0 < b < 64, a \ne b$, which looked promising as a preliminary step.
I first saw a very similar construction in SHA-512, but as y = ror(x, a) ^ ror(x, b) ^ x >> c. The snippet with two rors above is the special case where the last term is omitted.

1. A better mixer

Most recently I have come up with this:
M = 0x9fb21c651e98df25L, S = 28, R1 = 49, R2 = 24
uint64_t rrmxmx(uint64_t v) {
  v ^= ror64(v, R1) ^ ror64(v, R2);
  v *= M;
  v ^= v >> S;
  v *= M;

  return v ^ v >> S;
}

Before I get to the advantages of using this over MurmurHash3's finalizer or Stafford's variants, I'll describe how I found the constants.

1.1 Search strategy

I'm kind of happy that I was unaware of how MurmurHash3 worked as I have pretty much always just whipped out a hash function when I've needed one based on the observations that "multiplication is good for mangling bits upwards and xorshift right works alright for mangling downwards". For me, this has been working well in practice.

As I was looking for constants I was using a different energy criteria based on properties 1 and 2 above:
Create a series of low entropy base constants such as $1, 3, 5, 7, 11_{16}, 55_{16}, 5555_{16}$ and so on. Create all 64 bit rotations of these constants along with their bitwise complements, bit order reversed and bit order reversed and complemented. Each base constant gives rise to 4 * 64 possible constants.
Also create all 64 bit values with Hamming weights 2 and 3, and their complements. Add all the constants to a set, $C$, making sure that each constant occurs exactly once. $\#C$ is the number of constants.


In pseudocode:
Let H be a histogram with #C * 65 bins.
for i in 1, ..., n // Do n experiments
  x = rnd() // Get random number on [0, 2^64)
  h = f(x) // f is our mixer
  for c in C
    y = h XOR f(x XOR c)
    p = hammingWeight(y)
    cIdx = indexOf(c in C)
    h[cIdx][p]' = h[cIdx][p] + 1
  end for
end for

When this is done, I compute a goodness-of-fit for each $h_c$. If $f(x)$ is a good approximation of a random permutation, $h_c$ should be binomially distributed with $B(n, \frac{1}{2})$.
The goodness of fit-values are stored in an array with $\#C$ values. From this array I compute the mean, $\bar{x}$ and sample standard deviation, $s$. My energy function is simply $\bar{x} + s$. No, I don't have a good theoretical justification for this...

A sounder metric would be to truncate each histogram (bin bits $0, \ldots, k$ and $64 - k, \ldots, 64$ into a low-bin and high-bin) so that the lowest expected frequency would be around 5, then do a chi-square test and note the p-values. The p-values should now look as they were drawn from a $U(0, 1)$ distribution. One problem with this approach is that it would be insensitive to errors where the Hamming distance is very low or very high.

As can be seen from the code above, the compilation of the histogram will take time that is linearly proportional to $n\#C$. If $C$ is large, this is really slow.
To speed things up a bit, for any M, S, R1, R2, I start by using a preliminary value for $n$ as well as a "small" set for $C$. Let's call this preliminary set of constants $C_0$ (currently of size 24384)
and the extended set $C_1$ (currently of size 621408). I now test with the preliminary values. If the energy from the preliminary test is at most 1.1 times the lowest energy seen so far, I run an extensive test with a larger $n$ and $C_1$ instead of $C_0$.

When a new set of constants shows up that are better or close to as good as the old set, the search program tries the M-value, mutated first by single bit flips, then by all possible 2-bit flips and so on. For each candidate M, shifts (S) in the range 24-52 are tried. During the initial search for good constants I noticed that the error landscape isn't smooth at all. In particular decreasing or increasing S by one might lead to a worse energy even if there are shifts that are better. Hence, for each M do a complete (preliminary) test for all S in the range.

2. Results

I plugged MurmurHash3 and Stafford's Variant 13 into my search program and both fared badly. The sample means weren't so bad but the sample standard deviations were terrible. Clearly, for some constants $c$ as described above, the Hamming weights of $f(x) \oplus f(x \oplus c)$ were far from binomially distributed. When isolating the tests to only 2-bit mutations, measuring the second order avalanche, both resulted in values far from the expected 1.0.

2.1 An avalanche metric

In [1], Figure 18, this snippet (in Java) is given to compute a first order sum-of-square avalanche statistic, with the bug with 1 vs. 1L as described above corrected:
long[][] A = new long[64][64];
for (long n = 0; n < N; n++) {
  long v = rng.nextLong();
  long w = mix64(v);
  for (int i = 0; i < 64; i++) {
    long x = w ^ m.mix(v ^ (1L << i));
    for (int j = 0; j < 64; j++) {
      if (((x >>> j) & 1) != 0) {
        A[i][j] += 1;
      }
    }
  }
}
double sumsq = 0.0;
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    double v = a[j][k] - N / 2.0;
    sumsq += v*v;
  }
}
double result = sumsq / ((N / 4.0) * 4096.0);
The code above is extended to compute higher order avalanche statistics like this (second order example in C):
double measureAvalanche2_x(uint64_t (*h) (uint64_t), int runExp,
    uint64_t inc, int complemented, int bins) {
  assert(2016 % bins != 0); // choose(64, 2) == 2016
  uint64_t* A = allocA(bins);
  uint64_t C = (complemented ? 0xFFFFFFFFFFFFFFFFUL : 0);
  uint64_t N = 1L << runExp;
  for (uint64_t n = 0; n < N; n++) {
    uint64_t v = n * inc;
    uint64_t w = h(v);
    int p = 0;
    for (int i = 0; i < 64; i++) {
      for(int j = i + 1; j < 64; j++) {
        uint64_t x = w ^ h(v ^ (1L << i) ^ (1L << j) ^ C);
        for (int k = 0; k < 64; k++) {
          if (((x >> k) & 1) != 0) {
            A[p * 64 + k]++;
          }
        }
        p = (p + 1) % bins;
      }
    }
  }
  double result = sumsq(A, bins, N * (2016 / bins));
  freeA(A);
  return result;
}

2.2 Measurements

The number of possible $k$-bit differences in a 64-bit word is ${64\choose k}$. To be able to run the tests for many instances I wrote a CUDA-program. To create less memory contention, each thread works on a small number of bins instead of the full number. Using the extensions to 2nd, 3rd and 4th order avalanche, with $bins_{1, \ldots, 4} = \{64, 288, 217, 217\}$, here's the avalanche sum-of-squares statistic of order 1, 2, 3, 4. Inputs are the products of $a \times [0, 2^{30})$ for $\Delta_1$, $a \times [0, 2^{25})$ for $\Delta_2$, $a \times [0, 2^{20})$ for $\Delta_3$ and $a \times [0, 2^{20})$ for $\Delta_4$:
Table 1: Avalanche statistics
Mixer a = 0x40EAD42CA1CD0131
$\Delta_1$ $\Delta_2$ $\Delta_3$ $\Delta_4$
rrmxmx 0.975 0.992 1.039 1.005
Murmur3 1.423 11049.99 1.003 3.004
Variant13 1.008 2131.30 25.46 1.271

Note that both MurmurHash3 and Variant13 have a second order sum-of-squares avalanche statistic that is very far from what should be expected from a randomly chosen permutation. Variant13 fares better in almost all tests than the original MurmurHash3 but much worse with regard to third order avalanche.

2.3 Testing as a PRNG with TestU01

Neither MurmurHash3 or Variant13 had property 3 in the introduction above ("feeding low entropy data should render a random-looking output sequence") as a design goal. Since Variant13 and partly, MurmurHash3, are being used as the mixing functions in Java 8's SplittableRandom, they deserve more scrutiny with regard to that property.
I ran TestU01's BigCrush on my own function, called rrmxmx, MurmurHash3 and Variant13 with a simple counter, similar to how the state update is done in SplittableRandom. p-values outside $[10^{−3} , 1 − 10^{−3}]$ were flagged as "suspect" and p-values outside $[10^{−10} , 1 − 10^{−10}]$ were flagged as "failed".

TestU01 is not geared towards testing random bit streams as such but is (mostly) word oriented with 30- and 31-bit words. If one assumes that the output to be tested is a stream of bits rather than a stream of words, it's prudent to at least also feed TestU01 with all output words bit-reversed.

The data in Table 2 below are from using rrmxmx, MurmurHash3 and Variant13 to generate a pseudo random stream in this fashion:
uint64_t next(uint64_t *state, uint64_t gamma) {
  uint64_t s = *state;
  *state += gamma;
  return mix(s);
}
This is fed to TestU01 by
static void runTestsInt(unsigned int (*prng)(void), char name[]) {
  unif01_Gen* gen = unif01_CreateExternGenBits(name, prng);
  bbattery_BigCrush(gen);

  unif01_DeleteExternGenBits(gen);
}
Note that I use all 64 bits of output, without floating, and that this is stricter/more revealing.
TestU01 works either on 32 bit words, using 30 to 31 bits, or on 64-bit doubles. In the tests below, I split each 64-bit word in two 32-bit halves and use the halves sequentially.

The procedure used in [1] is described in Section 5 of [1]:
"TestU01 expects to test 64-bit double values, so we use 53 out of every 64 bits generated by our algorithm to generate a double value as for our method nextDouble."
This masks some serious defects in the 11 least significant bits of MurmurHash3 and Variant13 when $\gamma$ is small. Except for large $\gamma$, BigCrush generates many more suspect/failure-values than expected for MurmurHash3 and Variant13. Note that rrmxmx seems to be unaffected by small $\gamma$.
Floating in [1] is done by
(nextLong() >>> 11) * DOUBLE_ULP
thus discarding the 11 least significant bits. This practice in itself is not any better than using the least significant 53 bits, as in
(nextLong() & ((1L << 53) - 1)) * DOUBLE_ULP
but seems to be common practice anyway.
For a linear congruential generator, right shifting makes sense due to the most significant bits having a longer period than the least significant bits. For a properly designed generator, all output bits should be equally useful.

I have chosen the same number of tests as in [1], 3 runs of BigCrush for each {mixer, gamma and word reversal}-tuple. The runs for a particular {mixer, gamma, reversal}-tuple differ in the starting point chosen; $runIdx \times \gamma \times 2^{40}$, the first run having $runIdx = 0$. By chosing the starting points this way, the runs will be free from overlap since BigCrush uses about $2^{38}$ values.

In [1], the authors give a slightly low number for the expected number of "suspect" tests;
“TestU01 regards any p-value outside the range $[10^{-10}, 1 - 10^{-10}]$ as indicating “clear failure” of the tested generator. The TestU01 software flags all p-values outside the range $[10^{-3}, 1 - 10^{-3}]$; such values, if they are not clear failures, are regarded as “suspect.” For each generator variant we ran BigCrush 3 times and we report the total number of suspect p-values and clear failures over all 3 runs; each run of BigCrush performs 160 distinct tests, so 3 runs collectively perform 480 tests. With 480 tests at two-sided p = 0.001, we expect even a perfect generator to exhibit 0.96 “suspect” values on average.”
I suspect "160" and "480" are typos. Pages 148--152 of The "Compact user's guide" [3] for TestU01 list the individual tests and parameters, giving a count of 106. According to page 147 of [3], several tests give more than one p-value:
"Some of the tests compute more than one statistic using the same stream of random numbers and these statistics are thus not independent. That is why the number of statistics in the summary reports is larger than the number of tests in the description of the batteries."
Assuming the p-values are independent, with 3 runs of with 254 values each and two-sided p = 0.001, we expect a perfect generator to exhibit 1.524 “suspect” values on average.
Table 2: TestU01 BigCrush Failures
Mixer $\gamma$ = 1 $\gamma$ = 0x55555555 $\gamma$ = 0xC45A11730CC8FFE3
Forward Reverse Forward Reverse Forward Reverse
suspect failed suspect failed suspect failed suspect failed suspect failed suspect failed
rrmxmx 3 0 2 0 2 0 1 0 1 0 2 0
Murmur3 6 47 9 140 10 4 14 37 1 0 1 0
Varian13 9 24 14 38 9 3 10 1 0 0 1 0

As can be seen in Table 2, both MurmurHash3 and Variant13 fails horribly for small gamma. This is the code for getting a new gamma when splitting in SplittableRandom:
private static long mixGamma(long z) {
 // MurmurHash3 mix constants   z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;
  z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
  z = (z ^ (z >>> 33)) | 1L; // force to be odd
  int n = Long.bitCount(z ^ (z >>> 1)); // ensure enough transitions
  return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}
Compared to [1], MurmurHash3 and Variant13 have switched purposes, MurmurHash3 is used for new gammas and Variant13 is used for the prng-stream. Alas, the mixGamma() function above does not prevent bad gammas from occuring. 0x55555555 is, as can be seen by testing, a gamma that fares very badly but is still entirely possible. I do not yet know how large a proportion of the values from mixGamma() are problematic and give rise to a prng-instance that have highly undesirable properties. Melissa O'Neill made the same reflection on bad gammas before me in her blog post "Bugs in SplitMix(es)". They are trivial to construct, the question is at what frequency they are showing up in practice?

2.4 Testing as a PRNG with PractRand

I also decided to test with PractRand, V0.93, with options "-tf 2 -te 1 -tlmin 1KB". All values of $\gamma$ in Table 1 of [1] marked "poor" and "okay" fail PractRand after a very small amount of output from MurmurHash3 and Variant13. The gammas marked as "random" in Table 1 of [1] fail with MurmurHash3 as the mixing function after at most $2^{41}$ bytes. It's clear that Variant13 is more well behaved than MurmurHash3, but Variant3 fails PractRand quite rapidly for many gammas. Are the gammas marked as "random" a fluke or the norm with regards to how well behaved Variant13 is? I'll try some experiments in a blog post to come, trying to estimate the likelihood that SplittableRandom gets a bad gamma.

Table 3: Test length where PractRand fails
$\gamma$ rrmxmx Murmur3 Variant13
0x0000000000000001
$2^{44}$ $2^{16}$ $2^{21}$
0x0000000000000003
$2^{47.71}$ $2^{18}$ $2^{22}$
0x0000000000000005
$> 2^{48}$ $2^{19}$ $2^{21}$
0x0000000000000009
$> 2^{48}$ $2^{19}$ $2^{22}$
0x0000010000000001
$2^{46.41}$ $2^{18}$ $2^{20}$
0xffffffffffffffff
$2^{45.62}$ $2^{16}$ $2^{21}$
0x0000000000ffffff
$2^{45.39}$ $2^{17}$ $2^{22}$
0xffffff0000000001
$2^{46.55}$ $2^{18}$ $2^{22}$
0x0000000000555555
$2^{47.42}$ $2^{25}$ $2^{30}$
0x1111111111110001
$> 2^{48}$ $2^{27}$ $2^{32}$
0x7777777777770001
$> 2^{48}$ $2^{28}$ $2^{34}$
0x7f7f7f7f33333333
$> 2^{48}$ $2^{31}$ $2^{33}$
0x5555550000000001
$> 2^{48}$ $2^{24}$ $2^{30}$
0xc45a11730cc8ffe3
? $2^{39}$ $> 2^{42}$
0x2b13b77d0b289bbd
? $2^{39}$ $> 2^{42}$
0x40ead42ca1cd0131
? $2^{40}$ $> 2^{42}$
After $2^{43}$ bytes, PractRand has yet not complained when running Variant13 with $\gamma = 40EAD42CA1CD0131_{16}$.

rrmxmx fails at the $2^{44}$ level with $\gamma = 1$. With $\gamma = 3$, it passes PractRand up to $2^{45}$ bytes but with one test marked as suspicious at that level;
rng=RNG_stdin, seed=0xca839ae9
length= 32 terabytes (2^45 bytes), time= 506449 seconds
  Test Name Raw Processed Evaluation
  BCFN_FF(2+2):freq R= +8.3 p~= 1e-7 suspicious
  ...and 1900 test result(s) without anomalies
Considering that the test at the $2^{45}$ level took 140 hours I won't run a test at $2^{46}$ level but some caution is recommended if running rrmxmx as a PRNG with very sparse (1 or 2-bit) gammas. Currently I am running TestU01 for gammas of the form $2^n$ for $0 < n < 25$. Assuming rrmxmx behaves as a randomly chosen permutation, these gammas should pass too. One would of course never use them for a PRNG considering that they shorten the period to $2^{64 - n}$.

3. Speed

The operations in rrmxmx are identical to MurmurHash3 except for the first step; instead of a right shift and xor there are two rotates and 2 xors. With my machine (i7-5820), ILP makes the speed of rrmxmx, MurmurHash3 and Variant13 identical as far as I can measure.

4. Further work

Assuming
x ^= ror64(x, a) ^ ror64(x, b)
is just as fast as
x ^= x >> a
(On my Intel i7-5820 and on AMD Ryzen 7 1700 it is, for many other processors, it isn't.)
A search should be made to see if there could be a better version looking like this:
uint64_t rrmxmx2(uint64_t v) {
  uint64_t v2 = v;
  v2 ^= ror64(v2, R1) ^ ror64(v2, R2);
  v2 *= M1;
  v2 ^= ror64(v2, R3) ^ ror64(v2, R4);
  v2 *= M2;

  return v2 ^ ror64(v2, R5) ^ ror64(v2, R6);
}
The indication that rrmxmx with the constants M = 0x9FB21C651E98DF25, S = 28, R1 = 49, R2 = 24 can be improved is that there is a clear difference when looking at the goodness-of-fit for different constants c as described in section 1.1 above.

5. Conclusion (TL;DR)

If you have the need for a reasonably fast permutation on 64-bit words that looks quite random, this function might be useful. MurmurHash3 and Variant13 are very easy to distinguish from random permutations as shown above.

There may be better constants for MurmurHash13 than David Stafford's improvements; I have not been searching using my energy function.

A. Source code

A.1 C:

#include <stdint.h>

static inline uint64_t ror64(uint64_t v, int r) {
    return (v >> r) | (v << (64 - r));
}

uint64_t rrmxmx(uint64_t v) {
    v ^= ror64(v, 49) ^ ror64(v, 24);
    v *= 0x9FB21C651E98DF25L;
    v ^= v >> 28;
    v *= 0x9FB21C651E98DF25L;
    return v ^ v >> 28;
}

// Code below only necessary if one wants to compute the inverse rrmxmx.
uint64_t xorRor24_49Inv(uint64_t v) {
    const int rors[] = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
        25, 26, 29, 30, 32, 40, 41, 43, 44,
        45, 48, 50, 54, 56, 57, 58, 60 };
    uint64_t acc = 0;
    for (int i = 0; i < sizeof(rors) / sizeof(int); i++) {
       acc ^= ror64(v, rors[i]);
    }

    return acc ^ v;
}

static uint64_t xorShift28Inv(uint64_t v) {
    return v ^ (v >> 28) ^ (v >> 56);
}

uint64_t unrrmxmx(uint64_t v) {
    uint64_t x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
    x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
    return xorRor24_49Inv(x);
}

A.2 Java:

public static long rrmxmx(long v) {
    v ^= Long.rotateRight(v, 49) ^ Long.rotateRight(v, 24);
    v *= 0x9FB21C651E98DF25L;
    v ^= v >>> 28;
    v *= 0x9FB21C651E98DF25L;
    return v ^ v >>> 28;
}

// Code below only necessary if one wants to compute the inverse rrmxmx.
public static long unrrmxmx(long v) {
    long x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
    x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
    return invertRor2449(x);
}

private static long invertRor2449(long v) {
    int[] rots = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
        25, 26, 29, 30, 32, 40, 41, 43, 44,
        45, 48, 50, 54, 56, 57, 58, 60 };
    long acc = 0;
    for (int r : rots) {
        acc ^= Long.rotateRight(v, r);
    }

    return acc ^ v;
}

private static long xorShift28Inv(long v) {
    return v ^ (v >>> 28) ^ (v >>> 56);
}

B. Expected output/test vectors

0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000001
0x23085d6f7a569905
0x56ed9162154faac0
0x0000000000000001
0x0000000000000003
0xcaea878c77a59454
0x0ec1bfbe6983c5a0
0x0000000000000003
0x0000000000000007
0xa77bd5a63a7785c5
0x1718113ac9a1f119
0x0000000000000007
0x0101010101010101
0x36cb9e821eca6c5b
0xfa63351a390851cd
0x0101010101010101
0x0123456789abcdef
0xc337a528d7e42497
0x7529d4da142b1f1c
0x0123456789abcdef
0x084c2a6e195d3b7f
0x507d53f1ba22542c
0xec3694cd1c80b9cd
0x084c2a6e195d3b7f
0x1000000000000001
0xedd3f3f24766de89
0xdb302dae3ad882e0
0x1000000000000001
0x1111111111111111
0x7547f019c63c1df3
0xea6d9bbf167027c9
0x1111111111111111
0x1fffffffffffffff
0x05e3c8367d6677d6
0x7fbbf24327033cf0
0x1fffffffffffffff
0x3fffffffffffffff
0x47e7c1e973d349ff
0x240ba915bbb5e089
0x3fffffffffffffff
0x6666666666666666
0xd9c6e8c9ecd1e30a
0xf4b9c6565f8d9529
0x6666666666666666
0x7777777777777777
0x29823cb92ada0068
0xdca549733043f019
0x7777777777777777
0x7f7f7f7f7f7f7f7f
0xc58024da69c2eb57
0xf1d5238b66aaaf5e
0x7f7f7f7f7f7f7f7f
0x7ffffffffffffff7
0x30c8918fcb6b2b3c
0x3a836e49ca560dd8
0x7ffffffffffffff7
0x7fffffffffffffff
0x91b750beb6849d8f
0x90354478a1b6e49d
0x7fffffffffffffff
0x8000000000000000
0x5e2d59ded82568fc
0xa0f3362cbce5bedb
0x8000000000000000
0x8000000000000008
0xae03d8a5f03d42bb
0xed1a6dc89b6e22d2
0x8000000000000008
0x8080808080808080
0x269ed61ad0d4a3ad
0xcf8b0a0dccbf9da9
0x8080808080808080
0x8888888888888888
0x2f6af135bf8e9d79
0x2c50b3a1d5c7a854
0x8888888888888888
0x9999999999999999
0x50a99564c864eb28
0x6ae2b8e14b6d3c7c
0x9999999999999999
0xc000000000000000
0xf5f0f95fcd968a80
0x6ae70fea73bd7a6d
0xc000000000000000
0xe000000000000000
0x160c347d11027361
0x9a3d176b24d68305
0xe000000000000000
0xeeeeeeeeeeeeeeee
0x9f9714241fb64d9e
0x0a40b8632cad4bfa
0xeeeeeeeeeeeeeeee
0xeffffffffffffffe
0x742025f2e92e6aec
0xf7eaaefaaa16ddb8
0xeffffffffffffffe
0xf7b3d591e6a2c480
0x60f421f08a38d500
0xf520f63f955ac204
0xf7b3d591e6a2c480
0xfedcba9876543210
0x8fec24c21c6d66de
0xf18dbb478c6d3943
0xfedcba9876543210
0xfefefefefefefefe
0x125c8836f02c998f
0xe4b673f0521ad37d
0xfefefefefefefefe
0xfffffffffffffff8
0x6018ed12f08b6eec
0x1b32e354639f82f1
0xfffffffffffffff8
0xfffffffffffffffc
0x420b85f7b23fa512
0xe317247fad148210
0xfffffffffffffffc
0xfffffffffffffffe
0xc320bdd84877d048
0x31c9d93c42d48cea
0xfffffffffffffffe
0xffffffffffffffff
0x8bc57fddf83265bd
0xb694bf1eaa6682c4
0xffffffffffffffff

20 kommentarer:

  1. This is a great writeup. I am having a terrible feeling the question I am about to ask may sound ridiculously stupid. Why is function "y = x ^ ror(x, a) ^ ror(x, b)" bijective? A simple case when x = 101, a = 1, b = 2 results in y = 000; similarly when x = 110 or x = 011 the function also produces y = 000. I am almost convinced I am missing something

    SvaraRadera
    Svar
    1. I assume you mean 101 in binary, 5 in (hexa)decimal.
      First case, x = 101_2:
      ror64(0x5, 1) == 0x8000000000000002
      ror64(0x5, 2) == 0x4000000000000001

      0x0000000000000005
      0x8000000000000002
      ^ 0x4000000000000001
      --------------------
      0xC000000000000008

      Second case, x = 110_2
      ror64(0x6, 1) == 0x0000000000000003
      ror64(0x6, 2) == 0x8000000000000001

      0x0000000000000003
      0x8000000000000001
      ^ 0x0000000000000006
      --------------------
      0x800000000000000A != 0xC000000000000008

      That x ^ ror64(x, a) ^ ror64(x, b) is bijective might not be entirely obvious. Note that x ^ ror64(x, a) isn't, for any a.

      I checked by verifying that there exists an inverse for every {a, b} when using 64 bits. Since rotates, shifts and xor all can be expressed by multiplication with a binary matrix, I constructed the relevant matrices in Octave and checked that there exists an inverse for each matrix.

      Perhaps not so elegant but it got the job done.

      Coming up is a post for word sizes 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 20, 24, 30, 36, 40, 48, 56 & 64. Plus variants. I guess I should put the constants up on github, for the cases where not all pairs are bijective.

      One could of course also mix rotations and shifts or have more than 2 rotations/shifts for a single xor.
      I did compute all invertible
      pairs {a, b} for
      f_2(x, a, b) { return x ^ (x >> a) ^ (x << b); }

      triples {a, b, c} for
      f_3(x, a, b) { return x ^ (x >> a) ^ (x >> c) ^ (x << b); }

      and quadruples {a, b, c, d} for
      f_3(x, a, b) { return x ^ (x >> a) ^ (x >> c) ^ (x << b) ^ (x << d); }
      a for 32- and 64-bits a while ago.

      You might want to read this new post: http://mostlymangling.blogspot.com/2018/09/invertible-additions-mod-2.html

      Radera
    2. I should have guessed that "x ^= ror(x, a) ^ ror(x, b)" implies number of bits is 64; I erroneously assumed it was generic. My example above doesn't really apply to your calculations which are only dealing with 64 bit numbers. It looks like the function is bijective when a number of bits is not divisible by 3. I knew I was missing something, thanks for clarification.

      Radera
  2. I found an interesting result based on this post, using xor-rotate-rotate but omitting one of the xorshifts. In this code, rotate64 is a left-rotation defined in a macro:

    uint64_t state = UINT64_C(0xEB5F2AE4); // any should work, this was tested
    uint64_t random64() {
    uint64_t z = (++state ^ UINT64_C(0xDB4F0B9175AE2165)) * UINT64_C(0x4823A80B2006E21B);
    z = (z ^ rotate64(z, 52) ^ rotate64(z, 21) ^ UINT64_C(0x9E3779B97F4A7C15)) * UINT64_C(0x81383173);
    return z ^ z >> 28;
    }

    (I probably made some elementary C++ mistakes in the UINT64_C usage.) This is one of very few gamma=1 unary hashes I have tested that passes 32TB of PractRand, and is probably the fastest of the ones I have found due to omitting one of the xorshift-multiply lines. Any multiplications use an XLCG's step of a XOR with a constant that has the last three bits 101, then a a multiplication by a constant with last three bits 011 (this makes an input of 0 produce a non-zero output). The initial step acts like a multiplication by a large gamma, but the XOR preceding it means that the modular multiplicative inverse of the gamma can't be used accidentally as the stride between inputs. I don't know why this set of rotation amounts, xors, shifts, and multiplies works but most others, when substituted in, don't. Some multipliers were chosen for having a low population count, in the hope that they would be faster. The XOR constants are 2 to the 64 divided by generalizations of the golden ratio, with the last bit adjusted to be odd. I hope this hash can be useful or interesting!

    SvaraRadera
    Svar
    1. Something that you may want to check out, part of a post coming up Real Soon Now, is feeding a rotated and bit reversed unary counters. That is 128 possible counters, feeding each to the mixing function and run the output though PractRand. So far, many failures out of 128 tests (running with - tf 2 -tlmax 1TB or so) seems to be a strong indicator that the function will eventually fail for a non-rotated unary counters. The MurmurHash3 finalizer and Variant 13 fails quite spectacularly. So far I have one candidate that fails one out of 128 tests after 1TB on that particular test. It uses two (ror-ror-xor)-mul steps and one shift-xor. Not the same constants used above. I also have a candidate with no failures so far but it has only passed about 16 put of the 128 subtests.

      Radera
    2. I ran a complete test, 128 instances with counters as described above and the results were not encouraging:
      ---
      Will test ./ettinger --increment 1 --seed 0 | RNG_test stdin64 -tf 2 -te 0 -multithreaded -tlmax 1TB -tlmin 1KB
      00F:40 *01F:15* *02F:15* *03F:14* *04F:15* *05F:15* *06F:16* *07F:16*
      *08F:15* *09F:16* *10F:15* *11F:15* *12F:15* *13F:16* *14F:16* *15F:16*
      *16F:16* *17F:16* *18F:16* *19F:17* *20F:16* *21F:18* *22F:19* *23F:18*
      *24F:18* *25F:17* *26F:19* *27F:17* *28F:17* *29F:18* *30F:19* *31F:20*
      *32F:20* *33F:21* *34F:22* *35F:22* *36F:23* *37F:19* *38F:19* *39F:16*
      *40F:17* *41F:19* *42F:19* *43F:22* *44F:24* *45F:28* *46F:33* *47F:35*
      *48F:36* *49F:35* *50F:37* *51F:37* *52F:33* *53F:37* *54F:37* *55F:38*
      *56F:39* 57F:40 58F:40 59F:40 60F:40 61F:40 62F:40 63F:40
      *00R:14* *01R:15* *02R:15* *03R:16* *04R:15* *05R:16* *06R:16* *07R:16*
      *08R:16* *09R:16* *10R:15* *11R:15* *12R:17* *13R:17* *14R:17* *15R:17*
      *16R:18* *17R:19* *18R:18* *19R:19* *20R:20* *21R:18* *22R:17* *23R:21*
      *24R:25* *25R:25* *26R:25* *27R:23* *28R:19* *29R:19* *30R:21* *31R:16*
      *32R:16* *33R:16* *34R:21* *35R:19* *36R:19* *37R:19* *38R:21* *39R:22*
      *40R:25* *41R:27* *42R:26* *43R:25* *44R:24* *45R:23* *46R:22* *47R:21*
      *48R:20* *49R:19* *50R:18* *51R:17* *52R:17* *53R:17* *54R:17* *55R:17*
      *56R:17* *57R:17* *58R:17* *59R:17* *60R:16* *61R:15* *62R:15* *63R:15*
      120 failures out of 128 tests with max 2^40 bytes.
      ---
      *XXF:AA* indicates that mix(ror64(ctr++, XX)) failed after 2^AA analyzed bytes.
      XXR:AA indicates that mix(ror64(reverse(ctr++), XX)) passed (-tlmax at 1TB, 2^40).

      So it behaves quite well with gamma=1 but that is a rather special case. As an approximation of a random permutation, it's not particularly strong.

      I'll try to find time later tonight to post the best I've found so far, (1 failure after 2^39 bytes, at ror64(reverse(ctr), 14)).

      (Your proposal is still much better than MurmurHash3 and Variant13.)

      Radera
    3. So there: https://mostlymangling.blogspot.com/2019/01/better-stronger-mixer-and-test-procedure.html
      I'd be very interested to know what flags to PractRand you were using when it passed 32TB with a unitary counter.

      Radera
    4. I used the command line:
      $ RNG_test.exe Ettinger -multithreaded -seed 0xa6948409

      But my PractRand code has been modified (not the tests, just adding random number generators). I added this code to mult.c in the "not recommended" section of the RNGs (rotate64 is defined by PractRand as a left rotation, not a ror64 like with the other code you're comparing with):

      Uint64 Ettinger::raw64() {
      //// passes 32TB with one anomaly, unusual at only 2GB: [Low16/64]BCFN(2+2,13-2,T)
      uint64_t z = (++state ^ UINT64_C(0xDB4F0B9175AE2165)) * UINT64_C(0x4823A80B2006E21B);
      z = (z ^ rotate64(z, 52) ^ rotate64(z, 21) ^ UINT64_C(0x9E3779B97F4A7C15)) * UINT64_C(0x81383173);
      return z ^ z >> 28;
      }
      std::string Ettinger::get_name() const { return "Ettinger"; }
      void Ettinger::walk_state(StateWalkingObject *walker) {
      walker->handle(state);
      }

      I added this in mult.h:

      class Ettinger : public vRNG64 {
      Uint64 state;
      public:
      Uint64 raw64();
      std::string get_name() const;
      void walk_state(StateWalkingObject *);
      };

      And it's added to the listing in RNG_from_name.h, for the section related to mult.h :

      REGISTER_RNG_0(Ettinger)


      Hope that helps. It looks like this one might be strong only when the counter is 1 and nothing is rotated.

      Radera
    5. If you have the time, please test it with -tf 2 too. That is the "standard level" I have been using when testing MurmurHash3, Variant 13 and rrmxmx (as well as my latest and greatest).

      Sorry about the sloppiness, will rerun the test with *left*-rotation and post the results.

      Radera
  3. Den här kommentaren har tagits bort av skribenten.

    SvaraRadera
  4. Hey Pelle, interesting work. Do you also have variants for 32-bit mixing?

    SvaraRadera
  5. Hello, you're doing a great job and I enjoyed reading your block. What license do you publish your rmxmx code under? Can I use it in my projects?

    SvaraRadera
    Svar
    1. Consider it public domain, credit me if you want to.

      Note however that this was one of my first (and worst) attempts at a good 64 bit mixer.
      A better rrxmrrxmxs-mixer is here:
      https://mostlymangling.blogspot.com/2019/01/

      A much better (and slightly slower) alternative is here:
      https://mostlymangling.blogspot.com/2020/01/

      The best compromise speed/quality I've seen is Jon Maiga's mx3 mixer:
      http://jonkagstrom.com/mx3/mx3_rev2.html

      The extra multiply and shift in mx3 is slightly slower than Murmur3-style mixers (3 xs, 2ms) but provides vastly superior results from a statistical standpoint. According to my measurements, it's on most platforms about as fast as NASAM.

      If your main platform is older GPUs (< GeForce RTX), NASAM is probably faster due to 64 bit multiplications being expensive.
      On more recent GPUs and modern 64 bit CPUs, mx3 is faster.

      Radera
    2. Thank you very much! Yes, of course I wouldn't take anything without credit. I'm interested in applications like short string hashing and fast generation of pseudo-random values from randomized but not uniformly distributed sources. But nothing that requires gigabytes per second. For example, with strings like local file paths I was sure that MurmurHash3 will be more than good enough, but no, I've encountered more than a few collisions of 64-bit hashes. Which is, of course, always possible, but shouldn't be that common.
      Oh, and before that I was sure FNV-1a will be more than enough, but had to upgrade to Murmur3 quite quickly :)

      P. S. I have now read every entry in your blog, looking forward for more, and thanks a lot for your work and for making it public,

      Radera
  6. Pelle,

    Will you expand on "Create a series of low entropy base constants such as 1,3,5,7,1116,5516,555516 and so on" or perhaps just throw your constant file up on Github? It would be great to have a fully defined testing methodology and the constants are the last bit of secret sauce.

    Even if not, thank you for writing as much as you did. It's an excellent reference.

    SvaraRadera
    Svar
    1. The constants were just some gammas with trivial bit patterns, 7_10 = 111_2, 11_16 = 10001_2, 55_16 = 1010101_2 and so on.

      I did develop a much more discriminating test strategy which is described here: https://mostlymangling.blogspot.com/2019/01/better-stronger-mixer-and-test-procedure.html#TestProcedurehttps://mostlymangling.blogspot.com/2019/01/better-stronger-mixer-and-test-procedure.html#TestProcedure

      Later posts apply it to some well known mixers (and some of my own).
      https://mostlymangling.blogspot.com/2019/12/
      https://mostlymangling.blogspot.com/2021/11/

      Glad that the blog posts are somewhat helpful. I have a bunch of more recent results but other parts of my life keeps me busy these days.

      Radera
    2. Are you saying that you're using the RR test as your metric when searching for constants for a new variant? My impression was that you were searching for variants using the modified avalanche metric above, and then proving them with the RR test when you've got a good candidate.

      Radera
    3. The criteria described above works for relatively fast sieving.

      I have not found any constants for the construction
      x ^= x >> S1; // Could be changed to x ^= ~x >> S1 if one is annoyed by the fix point at 0.
      for(i = 1 .. rounds):
      x *= M1;
      x ^= x >> S2; // Or ~x >> S2, also bijective.
      end for
      when rounds = 2 that doesn't fail the RRC test very badly. By adding another multiplication and xor-shift things behave much more nicely.
      I don't think there are any "golden constants" that are vastly superior compared to each other.

      The simplest way to improve the behaviour is to go from 2 rounds to 3.

      Radera
    4. I expect nobody can beat your golden constants. That's why I'd like to repeat your process. I'm searching for optimal constants for different forms using other hardware ops. For example: mul, bswap, mul, bswap, mul. Maybe with an add thrown in for good luck.

      Radera
  7. Den här kommentaren har tagits bort av skribenten.

    SvaraRadera