fredag 13 juli 2018

On the mixing functions in "Fast Splittable Pseudorandom Number Generators" , MurmurHash3 and David Stafford's improved variants on the MurmurHash3 finalizer.


0. Introduction

For a long time I have been looking for "good" permutations on $[0, 2^{32})$ and $[0, 2^{64})$. The definition of "good" in this context is a permutation that looks like it had been chosen at random from any of the $2^{32}!$ or $2^{64}!$ such possible functions and is reasonably fast. These design criteria are somewhat vague; the identity function is just as random as any other function on the domain. If speed wasn't an issue, any half decent block cipher would fit the bill.

A straightforward (but time consuming) test of "goodness" is to feed a low-entropy source such as a simple counter into the mixing function and passing the output to a test battery for randomness.

Perhaps the most well established set of tools for testing random number generation is L'Ecuyer and Simard's TestU01. It contains three predefined sets of tests ("batteries") that any stream of random words should expect to pass without being flagged as anomalous. For a 32-bit bijection, TestU01's batteries Crush and BigCrush should detect anomalies due to the short period, $2^{32}$, regardless of how well the permutation has been chosen.

A different test-suite that has a much more hackish flavour to it is PractRand. It has a tendency to fail some generators very quickly and takes a quite different approach to testing than TestU01 does; it runs until it fails spectacularly or $2^{46}$ bytes have been inspected.

A randomly chosen permutation $f: 2^b \rightarrow 2^b$ could be expected to have (at least) the following properties ($\oplus$ denoting exclusive or):
  1. For any $0 \le x < 2^b, 0 \le i < b, \Delta_1 = f(x) \oplus f(x \oplus 2^i)$ should have Hamming weights ("number of set bits", a.k.a. population count) that are binomially distributed, $B(b, \frac{1}{2})$.
  2. For any $0 \le x < 2^b$ and tuples $a$ of length $t, 0 \le t < b$ and $0 \le a_k < b, a_k < a_{k + 1}, 1 \le k \le t$, let $s = \sum_{i=1}^t 2^{a_i}$ and let $\Delta_t = f(x) \oplus f(x \oplus s)$.
    $\Delta_t$ should now also have Hamming weights with a $B(b, \frac{1}{2})$ distribution.
  3. When being fed a low entropy sequence, say $x_n = n$ or something of similar simplicty, $f(x_i) \Vert f(x_{i + 1}) \Vert \ldots$ should look like a random bit sequence when the output words are concatenated. 
The first property I think follows from the strict avalanche criterion, i.e. a function fulfilling the SAC also has binomially distributed Hamming weight-deltas.
The second property should follow from a function fulfilling a $t$-order SAC.

The other day I came across "Fast Splittable Pseudorandom Number Generators" [1] by Guy L. Steele Jr., Doug Lea and Christine H. Flood. Obviously, I am not Mr. Current Affairs since the paper is 4 years old by now. Anyway, I noticed that the mixing operation looked very similar to a family of functions I've been tinkering with, trying to find something that is reasonably fast and that satisfies the properties above. The mixing functions in [1] were in turn borrowed from Austin Appleby's MurmurHash3 and David Stafford's improved variants [2] on MurmurHash3.

MurmurHash3 uses a 64-bit mixer in its final step, looking like this:
uint64_t mix(uint64_t v) {
  v ^= (v >> S1);
  v *= M1;
  v ^= (v >> S2);
  v *= M2;
  return v ^ (v >> S3);
}
Where S1 = S2 = S3 = 33, M1 = 0xFF51AFD7ED558CCD and M2 = 0xC4CEB9FE1A85EC53.
I can't find Appleby's description on how the search for constants was performed but in section 4 of [1]:
"Appleby used this criterion [SAC, possibly first order?] as the “energy function” in a simulated annealing process to explore a specific space of potential 8-operation bit-mixing functions (both 32-bit and 64-bit versions."

Stafford chose a different search strategy than Appleby: trying low entropy data (instead of random numbers) in his inputs to the mixing function, assuming that a function that behaves well on low entropy inputs also would be well behaved for high entropy inputs. According to his search, that assumption was correct. He does, however, not specify what the low entropy inputs were or if he measured $f(x)$ or $f(x) \oplus f(x \oplus c)$ for low entropy (single bit?) constants $c$.

The authors of [1] give a short snippet in Fig. 18, that gives a sum-of-squares [first order] avalanche statistic with randomized inputs, flipping each bit to compute $\Delta_1$ as in property 1 above. They stick to the constants from MurmurHash3 and includes Variant13 from Stafford since they conclude that they may be close to optimal with regards to $\Delta_1$.

Fig. 18 of [1] , contains a glaring bug by the way, maybe that's not the code they used when doing their search but a simple error that got introduced when formatting the code for the paper:
  long x = w ^ m.mix(v ^ (1 << i));
should of course be:
  long x = w ^ m.mix(v ^ (1L << i));)

Anyhow, what is nice about xor-shifting is that it's guaranteed to be bijective on $2^b$ as long as the amount shifted, $s$, is $0 < s < b$. (This can be verified by checking that the binary matrices of $(I + L^s)$ and $(I + R^s)$ have inverses.)

What is not so nice, in the context of creating mixers, is that low input Hamming weights tends to stay low, thus not mixing things as fast as we would like.

Multiplication is a really nice operation in this context; the topmost bits are influence by all bits of lower significance.  As long as the constant is odd and the multiplication is done modulo $2^b$, an inverse exists and the operation is thus bijective. What's not so nice is that high input bits can't influence low order bits (when the modulus is on the form $2^b$).

Doing an interleaving of multiplying and xorshifting (to the right) is a really brilliant idea; multiplying makes all low order bits influence the top most bits. Xorshifting folds the topmost bits down to the right. And the defect that xorshifting has w.r.t. Hamming weights is mostly remedied by multiplication.

Anyway, I was looking for a better way to increase the avalanche rate but without having more than 2 multiplication/xorshift steps and figured that
x ^= ror(x, a) ^ ror(x, b)
is bijective when $0 \le x < 2^{64}, 0 < a < 64, 0 < b < 64, a \ne b$, which looked promising as a preliminary step.
I first saw a very similar construction in SHA-512, but as y = ror(x, a) ^ ror(x, b) ^ x >> c. The snippet with two rors above is the special case where the last term is omitted.

1. A better mixer

Most recently I have come up with this:
M = 0x9fb21c651e98df25L, S = 28, R1 = 49, R2 = 24
uint64_t rrmxmx(uint64_t v) {
  v ^= ror64(v, R1) ^ ror64(v, R2);
  v *= M;
  v ^= v >> S;
  v *= M;

  return v ^ v >> S;
}

Before I get to the advantages of using this over MurmurHash3's finalizer or Stafford's variants, I'll describe how I found the constants.

1.1 Search strategy

I'm kind of happy that I was unaware of how MurmurHash3 worked as I have pretty much always just whipped out a hash function when I've needed one based on the observations that "multiplication is good for mangling bits upwards and xorshift right works alright for mangling downwards". For me, this has been working well in practice.

As I was looking for constants I was using a different energy criteria based on properties 1 and 2 above:
Create a series of low entropy base constants such as $1, 3, 5, 7, 11_{16}, 55_{16}, 5555_{16}$ and so on. Create all 64 bit rotations of these constants along with their bitwise complements, bit order reversed and bit order reversed and complemented. Each base constant gives rise to 4 * 64 possible constants.
Also create all 64 bit values with Hamming weights 2 and 3, and their complements. Add all the constants to a set, $C$, making sure that each constant occurs exactly once. $\#C$ is the number of constants.


In pseudocode:
Let H be a histogram with #C * 65 bins.
for i in 1, ..., n // Do n experiments
  x = rnd() // Get random number on [0, 2^64)
  h = f(x) // f is our mixer
  for c in C
    y = h XOR f(x XOR c)
    p = hammingWeight(y)
    cIdx = indexOf(c in C)
    h[cIdx][p]' = h[cIdx][p] + 1
  end for
end for

When this is done, I compute a goodness-of-fit for each $h_c$. If $f(x)$ is a good approximation of a random permutation, $h_c$ should be binomially distributed with $B(n, \frac{1}{2})$.
The goodness of fit-values are stored in an array with $\#C$ values. From this array I compute the mean, $\bar{x}$ and sample standard deviation, $s$. My energy function is simply $\bar{x} + s$. No, I don't have a good theoretical justification for this...

A sounder metric would be to truncate each histogram (bin bits $0, \ldots, k$ and $64 - k, \ldots, 64$ into a low-bin and high-bin) so that the lowest expected frequency would be around 5, then do a chi-square test and note the p-values. The p-values should now look as they were drawn from a $U(0, 1)$ distribution. One problem with this approach is that it would be insensitive to errors where the Hamming distance is very low or very high.

As can be seen from the code above, the compilation of the histogram will take time that is linearly proportional to $n\#C$. If $C$ is large, this is really slow.
To speed things up a bit, for any M, S, R1, R2, I start by using a preliminary value for $n$ as well as a "small" set for $C$. Let's call this preliminary set of constants $C_0$ (currently of size 24384)
and the extended set $C_1$ (currently of size 621408). I now test with the preliminary values. If the energy from the preliminary test is at most 1.1 times the lowest energy seen so far, I run an extensive test with a larger $n$ and $C_1$ instead of $C_0$.

When a new set of constants shows up that are better or close to as good as the old set, the search program tries the M-value, mutated first by single bit flips, then by all possible 2-bit flips and so on. For each candidate M, shifts (S) in the range 24-52 are tried. During the initial search for good constants I noticed that the error landscape isn't smooth at all. In particular decreasing or increasing S by one might lead to a worse energy even if there are shifts that are better. Hence, for each M do a complete (preliminary) test for all S in the range.

2. Results

I plugged MurmurHash3 and Stafford's Variant 13 into my search program and both fared badly. The sample means weren't so bad but the sample standard deviations were terrible. Clearly, for some constants $c$ as described above, the Hamming weights of $f(x) \oplus f(x \oplus c)$ were far from binomially distributed. When isolating the tests to only 2-bit mutations, measuring the second order avalanche, both resulted in values far from the expected 1.0.

2.1 An avalanche metric

In [1], Figure 18, this snippet (in Java) is given to compute a first order sum-of-square avalanche statistic, with the bug with 1 vs. 1L as described above corrected:
long[][] A = new long[64][64];
for (long n = 0; n < N; n++) {
  long v = rng.nextLong();
  long w = mix64(v);
  for (int i = 0; i < 64; i++) {
    long x = w ^ m.mix(v ^ (1L << i));
    for (int j = 0; j < 64; j++) {
      if (((x >>> j) & 1) != 0) {
        A[i][j] += 1;
      }
    }
  }
}
double sumsq = 0.0;
for (int i = 0; i < 64; i++) {
  for (int j = 0; j < 64; j++) {
    double v = a[j][k] - N / 2.0;
    sumsq += v*v;
  }
}
double result = sumsq / ((N / 4.0) * 4096.0);
The code above is extended to compute higher order avalanche statistics like this (second order example in C):
double measureAvalanche2_x(uint64_t (*h) (uint64_t), int runExp,
    uint64_t inc, int complemented, int bins) {
  assert(2016 % bins != 0); // choose(64, 2) == 2016
  uint64_t* A = allocA(bins);
  uint64_t C = (complemented ? 0xFFFFFFFFFFFFFFFFUL : 0);
  uint64_t N = 1L << runExp;
  for (uint64_t n = 0; n < N; n++) {
    uint64_t v = n * inc;
    uint64_t w = h(v);
    int p = 0;
    for (int i = 0; i < 64; i++) {
      for(int j = i + 1; j < 64; j++) {
        uint64_t x = w ^ h(v ^ (1L << i) ^ (1L << j) ^ C);
        for (int k = 0; k < 64; k++) {
          if (((x >> k) & 1) != 0) {
            A[p * 64 + k]++;
          }
        }
        p = (p + 1) % bins;
      }
    }
  }
  double result = sumsq(A, bins, N * (2016 / bins));
  freeA(A);
  return result;
}

2.2 Measurements

The number of possible $k$-bit differences in a 64-bit word is ${64\choose k}$. To be able to run the tests for many instances I wrote a CUDA-program. To create less memory contention, each thread works on a small number of bins instead of the full number. Using the extensions to 2nd, 3rd and 4th order avalanche, with $bins_{1, \ldots, 4} = \{64, 288, 217, 217\}$, here's the avalanche sum-of-squares statistic of order 1, 2, 3, 4. Inputs are the products of $a \times [0, 2^{30})$ for $\Delta_1$, $a \times [0, 2^{25})$ for $\Delta_2$, $a \times [0, 2^{20})$ for $\Delta_3$ and $a \times [0, 2^{20})$ for $\Delta_4$:
Table 1: Avalanche statistics
Mixer a = 0x40EAD42CA1CD0131
$\Delta_1$ $\Delta_2$ $\Delta_3$ $\Delta_4$
rrmxmx 0.975 0.992 1.039 1.005
Murmur3 1.423 11049.99 1.003 3.004
Variant13 1.008 2131.30 25.46 1.271

Note that both MurmurHash3 and Variant13 have a second order sum-of-squares avalanche statistic that is very far from what should be expected from a randomly chosen permutation. Variant13 fares better in almost all tests than the original MurmurHash3 but much worse with regard to third order avalanche.

2.3 Testing as a PRNG with TestU01

Neither MurmurHash3 or Variant13 had property 3 in the introduction above ("feeding low entropy data should render a random-looking output sequence") as a design goal. Since Variant13 and partly, MurmurHash3, are being used as the mixing functions in Java 8's SplittableRandom, they deserve more scrutiny with regard to that property.
I ran TestU01's BigCrush on my own function, called rrmxmx, MurmurHash3 and Variant13 with a simple counter, similar to how the state update is done in SplittableRandom. p-values outside $[10^{−3} , 1 − 10^{−3}]$ were flagged as "suspect" and p-values outside $[10^{−10} , 1 − 10^{−10}]$ were flagged as "failed".

TestU01 is not geared towards testing random bit streams as such but is (mostly) word oriented with 30- and 31-bit words. If one assumes that the output to be tested is a stream of bits rather than a stream of words, it's prudent to at least also feed TestU01 with all output words bit-reversed.

The data in Table 2 below are from using rrmxmx, MurmurHash3 and Variant13 to generate a pseudo random stream in this fashion:
uint64_t next(uint64_t *state, uint64_t gamma) {
  uint64_t s = *state;
  *state += gamma;
  return mix(s);
}
This is fed to TestU01 by
static void runTestsInt(unsigned int (*prng)(void), char name[]) {
  unif01_Gen* gen = unif01_CreateExternGenBits(name, prng);
  bbattery_BigCrush(gen);

  unif01_DeleteExternGenBits(gen);
}
Note that I use all 64 bits of output, without floating, and that this is stricter/more revealing.
TestU01 works either on 32 bit words, using 30 to 31 bits, or on 64-bit doubles. In the tests below, I split each 64-bit word in two 32-bit halves and use the halves sequentially.

The procedure used in [1] is described in Section 5 of [1]:
"TestU01 expects to test 64-bit double values, so we use 53 out of every 64 bits generated by our algorithm to generate a double value as for our method nextDouble."
This masks some serious defects in the 11 least significant bits of MurmurHash3 and Variant13 when $\gamma$ is small. Except for large $\gamma$, BigCrush generates many more suspect/failure-values than expected for MurmurHash3 and Variant13. Note that rrmxmx seems to be unaffected by small $\gamma$.
Floating in [1] is done by
(nextLong() >>> 11) * DOUBLE_ULP
thus discarding the 11 least significant bits. This practice in itself is not any better than using the least significant 53 bits, as in
(nextLong() & ((1L << 53) - 1)) * DOUBLE_ULP
but seems to be common practice anyway.
For a linear congruential generator, right shifting makes sense due to the most significant bits having a longer period than the least significant bits. For a properly designed generator, all output bits should be equally useful.

I have chosen the same number of tests as in [1], 3 runs of BigCrush for each {mixer, gamma and word reversal}-tuple. The runs for a particular {mixer, gamma, reversal}-tuple differ in the starting point chosen; $runIdx \times \gamma \times 2^{40}$, the first run having $runIdx = 0$. By chosing the starting points this way, the runs will be free from overlap since BigCrush uses about $2^{38}$ values.

In [1], the authors give a slightly low number for the expected number of "suspect" tests;
“TestU01 regards any p-value outside the range $[10^{-10}, 1 - 10^{-10}]$ as indicating “clear failure” of the tested generator. The TestU01 software flags all p-values outside the range $[10^{-3}, 1 - 10^{-3}]$; such values, if they are not clear failures, are regarded as “suspect.” For each generator variant we ran BigCrush 3 times and we report the total number of suspect p-values and clear failures over all 3 runs; each run of BigCrush performs 160 distinct tests, so 3 runs collectively perform 480 tests. With 480 tests at two-sided p = 0.001, we expect even a perfect generator to exhibit 0.96 “suspect” values on average.”
I suspect "160" and "480" are typos. Pages 148--152 of The "Compact user's guide" [3] for TestU01 list the individual tests and parameters, giving a count of 106. According to page 147 of [3], several tests give more than one p-value:
"Some of the tests compute more than one statistic using the same stream of random numbers and these statistics are thus not independent. That is why the number of statistics in the summary reports is larger than the number of tests in the description of the batteries."
Assuming the p-values are independent, with 3 runs of with 254 values each and two-sided p = 0.001, we expect a perfect generator to exhibit 1.524 “suspect” values on average.
Table 2: TestU01 BigCrush Failures
Mixer $\gamma$ = 1 $\gamma$ = 0x55555555 $\gamma$ = 0xC45A11730CC8FFE3
Forward Reverse Forward Reverse Forward Reverse
suspect failed suspect failed suspect failed suspect failed suspect failed suspect failed
rrmxmx 3 0 2 0 2 0 1 0 1 0 2 0
Murmur3 6 47 9 140 10 4 14 37 1 0 1 0
Varian13 9 24 14 38 9 3 10 1 0 0 1 0

As can be seen in Table 2, both MurmurHash3 and Variant13 fails horribly for small gamma. This is the code for getting a new gamma when splitting in SplittableRandom:
private static long mixGamma(long z) {
 // MurmurHash3 mix constants   z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;
  z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
  z = (z ^ (z >>> 33)) | 1L; // force to be odd
  int n = Long.bitCount(z ^ (z >>> 1)); // ensure enough transitions
  return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}
Compared to [1], MurmurHash3 and Variant13 have switched purposes, MurmurHash3 is used for new gammas and Variant13 is used for the prng-stream. Alas, the mixGamma() function above does not prevent bad gammas from occuring. 0x55555555 is, as can be seen by testing, a gamma that fares very badly but is still entirely possible. I do not yet know how large a proportion of the values from mixGamma() are problematic and give rise to a prng-instance that have highly undesirable properties. Melissa O'Neill made the same reflection on bad gammas before me in her blog post "Bugs in SplitMix(es)". They are trivial to construct, the question is at what frequency they are showing up in practice?

2.4 Testing as a PRNG with PractRand

I also decided to test with PractRand, V0.93, with options "-tf 2 -te 1 -tlmin 1KB". All values of $\gamma$ in Table 1 of [1] marked "poor" and "okay" fail PractRand after a very small amount of output from MurmurHash3 and Variant13. The gammas marked as "random" in Table 1 of [1] fail with MurmurHash3 as the mixing function after at most $2^{41}$ bytes. It's clear that Variant13 is more well behaved than MurmurHash3, but Variant3 fails PractRand quite rapidly for many gammas. Are the gammas marked as "random" a fluke or the norm with regards to how well behaved Variant13 is? I'll try some experiments in a blog post to come, trying to estimate the likelihood that SplittableRandom gets a bad gamma.

Table 3: Test length where PractRand fails
$\gamma$ rrmxmx Murmur3 Variant13
0x0000000000000001
$2^{44}$ $2^{16}$ $2^{21}$
0x0000000000000003
$2^{47.71}$ $2^{18}$ $2^{22}$
0x0000000000000005
$> 2^{48}$ $2^{19}$ $2^{21}$
0x0000000000000009
$> 2^{48}$ $2^{19}$ $2^{22}$
0x0000010000000001
$2^{46.41}$ $2^{18}$ $2^{20}$
0xffffffffffffffff
$2^{45.62}$ $2^{16}$ $2^{21}$
0x0000000000ffffff
$2^{45.39}$ $2^{17}$ $2^{22}$
0xffffff0000000001
$2^{46.55}$ $2^{18}$ $2^{22}$
0x0000000000555555
$2^{47.42}$ $2^{25}$ $2^{30}$
0x1111111111110001
$> 2^{48}$ $2^{27}$ $2^{32}$
0x7777777777770001
$> 2^{48}$ $2^{28}$ $2^{34}$
0x7f7f7f7f33333333
$> 2^{48}$ $2^{31}$ $2^{33}$
0x5555550000000001
$> 2^{48}$ $2^{24}$ $2^{30}$
0xc45a11730cc8ffe3
? $2^{39}$ $> 2^{42}$
0x2b13b77d0b289bbd
? $2^{39}$ $> 2^{42}$
0x40ead42ca1cd0131
? $2^{40}$ $> 2^{42}$
After $2^{43}$ bytes, PractRand has yet not complained when running Variant13 with $\gamma = 40EAD42CA1CD0131_{16}$.

rrmxmx fails at the $2^{44}$ level with $\gamma = 1$. With $\gamma = 3$, it passes PractRand up to $2^{45}$ bytes but with one test marked as suspicious at that level;
rng=RNG_stdin, seed=0xca839ae9
length= 32 terabytes (2^45 bytes), time= 506449 seconds
  Test Name Raw Processed Evaluation
  BCFN_FF(2+2):freq R= +8.3 p~= 1e-7 suspicious
  ...and 1900 test result(s) without anomalies
Considering that the test at the $2^{45}$ level took 140 hours I won't run a test at $2^{46}$ level but some caution is recommended if running rrmxmx as a PRNG with very sparse (1 or 2-bit) gammas. Currently I am running TestU01 for gammas of the form $2^n$ for $0 < n < 25$. Assuming rrmxmx behaves as a randomly chosen permutation, these gammas should pass too. One would of course never use them for a PRNG considering that they shorten the period to $2^{64 - n}$.

3. Speed

The operations in rrmxmx are identical to MurmurHash3 except for the first step; instead of a right shift and xor there are two rotates and 2 xors. With my machine (i7-5820), ILP makes the speed of rrmxmx, MurmurHash3 and Variant13 identical as far as I can measure.

4. Further work

Assuming
x ^= ror64(x, a) ^ ror64(x, b)
is just as fast as
x ^= x >> a
(On my Intel i7-5820 and on AMD Ryzen 7 1700 it is, for many other processors, it isn't.)
A search should be made to see if there could be a better version looking like this:
uint64_t rrmxmx2(uint64_t v) {
  uint64_t v2 = v;
  v2 ^= ror64(v2, R1) ^ ror64(v2, R2);
  v2 *= M1;
  v2 ^= ror64(v2, R3) ^ ror64(v2, R4);
  v2 *= M2;

  return v2 ^ ror64(v2, R5) ^ ror64(v2, R6);
}
The indication that rrmxmx with the constants M = 0x9FB21C651E98DF25, S = 28, R1 = 49, R2 = 24 can be improved is that there is a clear difference when looking at the goodness-of-fit for different constants c as described in section 1.1 above.

5. Conclusion (TL;DR)

If you have the need for a reasonably fast permutation on 64-bit words that looks quite random, this function might be useful. MurmurHash3 and Variant13 are very easy to distinguish from random permutations as shown above.

There may be better constants for MurmurHash13 than David Stafford's improvements; I have not been searching using my energy function.

A. Source code

A.1 C:

#include <stdint.h>

static inline uint64_t ror64(uint64_t v, int r) {
    return (v >> r) | (v << (64 - r));
}

uint64_t rrmxmx(uint64_t v) {
    v ^= ror64(v, 49) ^ ror64(v, 24);
    v *= 0x9FB21C651E98DF25L;
    v ^= v >> 28;
    v *= 0x9FB21C651E98DF25L;
    return v ^ v >> 28;
}

// Code below only necessary if one wants to compute the inverse rrmxmx.
uint64_t xorRor24_49Inv(uint64_t v) {
    const int rors[] = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
        25, 26, 29, 30, 32, 40, 41, 43, 44,
        45, 48, 50, 54, 56, 57, 58, 60 };
    uint64_t acc = 0;
    for (int i = 0; i < sizeof(rors) / sizeof(int); i++) {
       acc ^= ror64(v, rors[i]);
    }

    return acc ^ v;
}

static uint64_t xorShift28Inv(uint64_t v) {
    return v ^ (v >> 28) ^ (v >> 56);
}

uint64_t unrrmxmx(uint64_t v) {
    uint64_t x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
    x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
    return xorRor24_49Inv(x);
}

A.2 Java:

public static long rrmxmx(long v) {
    v ^= Long.rotateRight(v, 49) ^ Long.rotateRight(v, 24);
    v *= 0x9FB21C651E98DF25L;
    v ^= v >>> 28;
    v *= 0x9FB21C651E98DF25L;
    return v ^ v >>> 28;
}

// Code below only necessary if one wants to compute the inverse rrmxmx.
public static long unrrmxmx(long v) {
    long x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
    x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
    return invertRor2449(x);
}

private static long invertRor2449(long v) {
    int[] rots = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
        25, 26, 29, 30, 32, 40, 41, 43, 44,
        45, 48, 50, 54, 56, 57, 58, 60 };
    long acc = 0;
    for (int r : rots) {
        acc ^= Long.rotateRight(v, r);
    }

    return acc ^ v;
}

private static long xorShift28Inv(long v) {
    return v ^ (v >>> 28) ^ (v >>> 56);
}

B. Expected output/test vectors

0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000000
0x0000000000000001
0x23085d6f7a569905
0x56ed9162154faac0
0x0000000000000001
0x0000000000000003
0xcaea878c77a59454
0x0ec1bfbe6983c5a0
0x0000000000000003
0x0000000000000007
0xa77bd5a63a7785c5
0x1718113ac9a1f119
0x0000000000000007
0x0101010101010101
0x36cb9e821eca6c5b
0xfa63351a390851cd
0x0101010101010101
0x0123456789abcdef
0xc337a528d7e42497
0x7529d4da142b1f1c
0x0123456789abcdef
0x084c2a6e195d3b7f
0x507d53f1ba22542c
0xec3694cd1c80b9cd
0x084c2a6e195d3b7f
0x1000000000000001
0xedd3f3f24766de89
0xdb302dae3ad882e0
0x1000000000000001
0x1111111111111111
0x7547f019c63c1df3
0xea6d9bbf167027c9
0x1111111111111111
0x1fffffffffffffff
0x05e3c8367d6677d6
0x7fbbf24327033cf0
0x1fffffffffffffff
0x3fffffffffffffff
0x47e7c1e973d349ff
0x240ba915bbb5e089
0x3fffffffffffffff
0x6666666666666666
0xd9c6e8c9ecd1e30a
0xf4b9c6565f8d9529
0x6666666666666666
0x7777777777777777
0x29823cb92ada0068
0xdca549733043f019
0x7777777777777777
0x7f7f7f7f7f7f7f7f
0xc58024da69c2eb57
0xf1d5238b66aaaf5e
0x7f7f7f7f7f7f7f7f
0x7ffffffffffffff7
0x30c8918fcb6b2b3c
0x3a836e49ca560dd8
0x7ffffffffffffff7
0x7fffffffffffffff
0x91b750beb6849d8f
0x90354478a1b6e49d
0x7fffffffffffffff
0x8000000000000000
0x5e2d59ded82568fc
0xa0f3362cbce5bedb
0x8000000000000000
0x8000000000000008
0xae03d8a5f03d42bb
0xed1a6dc89b6e22d2
0x8000000000000008
0x8080808080808080
0x269ed61ad0d4a3ad
0xcf8b0a0dccbf9da9
0x8080808080808080
0x8888888888888888
0x2f6af135bf8e9d79
0x2c50b3a1d5c7a854
0x8888888888888888
0x9999999999999999
0x50a99564c864eb28
0x6ae2b8e14b6d3c7c
0x9999999999999999
0xc000000000000000
0xf5f0f95fcd968a80
0x6ae70fea73bd7a6d
0xc000000000000000
0xe000000000000000
0x160c347d11027361
0x9a3d176b24d68305
0xe000000000000000
0xeeeeeeeeeeeeeeee
0x9f9714241fb64d9e
0x0a40b8632cad4bfa
0xeeeeeeeeeeeeeeee
0xeffffffffffffffe
0x742025f2e92e6aec
0xf7eaaefaaa16ddb8
0xeffffffffffffffe
0xf7b3d591e6a2c480
0x60f421f08a38d500
0xf520f63f955ac204
0xf7b3d591e6a2c480
0xfedcba9876543210
0x8fec24c21c6d66de
0xf18dbb478c6d3943
0xfedcba9876543210
0xfefefefefefefefe
0x125c8836f02c998f
0xe4b673f0521ad37d
0xfefefefefefefefe
0xfffffffffffffff8
0x6018ed12f08b6eec
0x1b32e354639f82f1
0xfffffffffffffff8
0xfffffffffffffffc
0x420b85f7b23fa512
0xe317247fad148210
0xfffffffffffffffc
0xfffffffffffffffe
0xc320bdd84877d048
0x31c9d93c42d48cea
0xfffffffffffffffe
0xffffffffffffffff
0x8bc57fddf83265bd
0xb694bf1eaa6682c4
0xffffffffffffffff
I've been interested in random number generation since the early 90's when I first took up programming in anything but 8-bit assembly.

Here I will try to share any own findings that I have stumbled across when searching for good random number generators and hash functions.