0. Introduction
For a long time I have been looking for "good" permutations on $[0, 2^{32})$ and $[0, 2^{64})$. The definition of "good" in this context is a permutation that looks like it had been chosen at random from any of the $2^{32}!$ or $2^{64}!$ such possible functions and is reasonably fast. These design criteria are somewhat vague; the identity function is just as random as any other function on the domain. If speed wasn't an issue, any half decent block cipher would fit the bill.A straightforward (but time consuming) test of "goodness" is to feed a lowentropy source such as a simple counter into the mixing function and passing the output to a test battery for randomness.
Perhaps the most well established set of tools for testing random number generation is L'Ecuyer and Simard's TestU01. It contains three predefined sets of tests ("batteries") that any stream of random words should expect to pass without being flagged as anomalous. For a 32bit bijection, TestU01's batteries Crush and BigCrush should detect anomalies due to the short period, $2^{32}$, regardless of how well the permutation has been chosen.
A different testsuite that has a much more hackish flavour to it is PractRand. It has a tendency to fail some generators very quickly and takes a quite different approach to testing than TestU01 does; it runs until it fails spectacularly or $2^{46}$ bytes have been inspected.
A randomly chosen permutation $f: 2^b \rightarrow 2^b$ could be expected to have (at least) the following properties ($\oplus$ denoting exclusive or):
 For any $0 \le x < 2^b, 0 \le i < b, \Delta_1 = f(x) \oplus f(x \oplus 2^i)$ should have Hamming weights ("number of set bits", a.k.a. population count) that are binomially distributed, $B(b, \frac{1}{2})$.
 For any $0 \le x < 2^b$ and tuples $a$ of length $t, 0 \le t < b$ and $0 \le a_k < b, a_k < a_{k + 1}, 1 \le k \le t$, let $s = \sum_{i=1}^t 2^{a_i}$ and let $\Delta_t = f(x) \oplus f(x \oplus s)$.
$\Delta_t$ should now also have Hamming weights with a $B(b, \frac{1}{2})$ distribution.  When being fed a low entropy sequence, say $x_n = n$ or something of similar simplicty, $f(x_i) \Vert f(x_{i + 1}) \Vert \ldots$ should look like a random bit sequence when the output words are concatenated.
The first property I think follows from the strict avalanche criterion, i.e. a function fulfilling the SAC also has binomially distributed Hamming weightdeltas.
The second property should follow from a function fulfilling a $t$order SAC.
The other day I came across "Fast Splittable Pseudorandom Number Generators" [1] by Guy L. Steele Jr., Doug Lea and Christine H. Flood. Obviously, I am not Mr. Current Affairs since the paper is 4 years old by now. Anyway, I noticed that the mixing operation looked very similar to a family of functions I've been tinkering with, trying to find something that is reasonably fast and that satisfies the properties above. The mixing functions in [1] were in turn borrowed from Austin Appleby's MurmurHash3 and David Stafford's improved variants [2] on MurmurHash3.
MurmurHash3 uses a 64bit mixer in its final step, looking like this:
"Appleby used this criterion [SAC, possibly first order?] as the “energy function” in a simulated annealing process to explore a specific space of potential 8operation bitmixing functions (both 32bit and 64bit versions."
Stafford chose a different search strategy than Appleby: trying low entropy data (instead of random numbers) in his inputs to the mixing function, assuming that a function that behaves well on low entropy inputs also would be well behaved for high entropy inputs. According to his search, that assumption was correct. He does, however, not specify what the low entropy inputs were or if he measured $f(x)$ or $f(x) \oplus f(x \oplus c)$ for low entropy (single bit?) constants $c$.
The authors of [1] give a short snippet in Fig. 18, that gives a sumofsquares [first order] avalanche statistic with randomized inputs, flipping each bit to compute $\Delta_1$ as in property 1 above. They stick to the constants from MurmurHash3 and includes Variant13 from Stafford since they conclude that they may be close to optimal with regards to $\Delta_1$.
MurmurHash3 uses a 64bit mixer in its final step, looking like this:
uint64_t mix(uint64_t v) {
v ^= (v >> S1);
v *= M1;
v ^= (v >> S2);
v *= M2;
return v ^ (v >> S3);
}
v ^= (v >> S1);
v *= M1;
v ^= (v >> S2);
v *= M2;
return v ^ (v >> S3);
}
Where S1 = S2 = S3 = 33, M1 = 0xFF51AFD7ED558CCD and M2 = 0xC4CEB9FE1A85EC53.
I can't find Appleby's description on how the search for constants was performed but in section 4 of [1]:"Appleby used this criterion [SAC, possibly first order?] as the “energy function” in a simulated annealing process to explore a specific space of potential 8operation bitmixing functions (both 32bit and 64bit versions."
Stafford chose a different search strategy than Appleby: trying low entropy data (instead of random numbers) in his inputs to the mixing function, assuming that a function that behaves well on low entropy inputs also would be well behaved for high entropy inputs. According to his search, that assumption was correct. He does, however, not specify what the low entropy inputs were or if he measured $f(x)$ or $f(x) \oplus f(x \oplus c)$ for low entropy (single bit?) constants $c$.
The authors of [1] give a short snippet in Fig. 18, that gives a sumofsquares [first order] avalanche statistic with randomized inputs, flipping each bit to compute $\Delta_1$ as in property 1 above. They stick to the constants from MurmurHash3 and includes Variant13 from Stafford since they conclude that they may be close to optimal with regards to $\Delta_1$.
Fig. 18 of [1] , contains a glaring bug by the way, maybe that's not the code they used when doing their search but a simple error that got introduced when formatting the code for the paper:
What is not so nice, in the context of creating mixers, is that low input Hamming weights tends to stay low, thus not mixing things as fast as we would like.
Multiplication is a really nice operation in this context; the topmost bits are influence by all bits of lower significance. As long as the constant is odd and the multiplication is done modulo $2^b$, an inverse exists and the operation is thus bijective. What's not so nice is that high input bits can't influence low order bits (when the modulus is on the form $2^b$).
Doing an interleaving of multiplying and xorshifting (to the right) is a really brilliant idea; multiplying makes all low order bits influence the top most bits. Xorshifting folds the topmost bits down to the right. And the defect that xorshifting has w.r.t. Hamming weights is mostly remedied by multiplication.
I first saw a very similar construction in SHA512, but as y = ror(x, a) ^ ror(x, b) ^ x >> c. The snippet with two rors above is the special case where the last term is omitted.
long x = w ^ m.mix(v ^ (1 << i));
should of course be:
long x = w ^ m.mix(v ^ (1L << i));)
Anyhow, what is nice about xorshifting is that it's guaranteed to be bijective on $2^b$ as long as the amount shifted, $s$, is $0 < s < b$. (This can be verified by checking that the binary matrices of $(I + L^s)$ and $(I + R^s)$ have inverses.)
What is not so nice, in the context of creating mixers, is that low input Hamming weights tends to stay low, thus not mixing things as fast as we would like.
Multiplication is a really nice operation in this context; the topmost bits are influence by all bits of lower significance. As long as the constant is odd and the multiplication is done modulo $2^b$, an inverse exists and the operation is thus bijective. What's not so nice is that high input bits can't influence low order bits (when the modulus is on the form $2^b$).
Doing an interleaving of multiplying and xorshifting (to the right) is a really brilliant idea; multiplying makes all low order bits influence the top most bits. Xorshifting folds the topmost bits down to the right. And the defect that xorshifting has w.r.t. Hamming weights is mostly remedied by multiplication.
Anyway, I was looking for a better way to increase the avalanche rate but without having more than 2 multiplication/xorshift steps and figured that
x ^= ror(x, a) ^ ror(x, b)
is bijective when $0 \le x < 2^{64}, 0 < a < 64, 0 < b < 64, a \ne b$, which looked promising as a preliminary step.I first saw a very similar construction in SHA512, but as y = ror(x, a) ^ ror(x, b) ^ x >> c. The snippet with two rors above is the special case where the last term is omitted.
1. A better mixer
Most recently I have come up with this:
M = 0x9fb21c651e98df25L, S = 28, R1 = 49, R2 = 24
uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, R1) ^ ror64(v, R2);
v *= M;
v ^= v >> S;
v *= M;
return v ^ v >> S;
}
v ^= ror64(v, R1) ^ ror64(v, R2);
v *= M;
v ^= v >> S;
v *= M;
return v ^ v >> S;
}
Before I get to the advantages of using this over MurmurHash3's finalizer or Stafford's variants, I'll describe how I found the constants.
As I was looking for constants I was using a different energy criteria based on properties 1 and 2 above:
Create a series of low entropy base constants such as $1, 3, 5, 7, 11_{16}, 55_{16}, 5555_{16}$ and so on. Create all 64 bit rotations of these constants along with their bitwise complements, bit order reversed and bit order reversed and complemented. Each base constant gives rise to 4 * 64 possible constants.
Also create all 64 bit values with Hamming weights 2 and 3, and their complements. Add all the constants to a set, $C$, making sure that each constant occurs exactly once. $\#C$ is the number of constants.
1.1 Search strategy
I'm kind of happy that I was unaware of how MurmurHash3 worked as I have pretty much always just whipped out a hash function when I've needed one based on the observations that "multiplication is good for mangling bits upwards and xorshift right works alright for mangling downwards". For me, this has been working well in practice.As I was looking for constants I was using a different energy criteria based on properties 1 and 2 above:
Create a series of low entropy base constants such as $1, 3, 5, 7, 11_{16}, 55_{16}, 5555_{16}$ and so on. Create all 64 bit rotations of these constants along with their bitwise complements, bit order reversed and bit order reversed and complemented. Each base constant gives rise to 4 * 64 possible constants.
Also create all 64 bit values with Hamming weights 2 and 3, and their complements. Add all the constants to a set, $C$, making sure that each constant occurs exactly once. $\#C$ is the number of constants.
In pseudocode:
When this is done, I compute a goodnessoffit for each $h_c$. If $f(x)$ is a good approximation of a random permutation, $h_c$ should be binomially distributed with $B(n, \frac{1}{2})$.
The goodness of fitvalues are stored in an array with $\#C$ values. From this array I compute the mean, $\bar{x}$ and sample standard deviation, $s$. My energy function is simply $\bar{x} + s$. No, I don't have a good theoretical justification for this...
A sounder metric would be to truncate each histogram (bin bits $0, \ldots, k$ and $64  k, \ldots, 64$ into a lowbin and highbin) so that the lowest expected frequency would be around 5, then do a chisquare test and note the pvalues. The pvalues should now look as they were drawn from a $U(0, 1)$ distribution. One problem with this approach is that it would be insensitive to errors where the Hamming distance is very low or very high.
As can be seen from the code above, the compilation of the histogram will take time that is linearly proportional to $n\#C$. If $C$ is large, this is really slow.
To speed things up a bit, for any M, S, R1, R2, I start by using a preliminary value for $n$ as well as a "small" set for $C$. Let's call this preliminary set of constants $C_0$ (currently of size 24384)
and the extended set $C_1$ (currently of size 621408). I now test with the preliminary values. If the energy from the preliminary test is at most 1.1 times the lowest energy seen so far, I run an extensive test with a larger $n$ and $C_1$ instead of $C_0$.
When a new set of constants shows up that are better or close to as good as the old set, the search program tries the Mvalue, mutated first by single bit flips, then by all possible 2bit flips and so on. For each candidate M, shifts (S) in the range 2452 are tried. During the initial search for good constants I noticed that the error landscape isn't smooth at all. In particular decreasing or increasing S by one might lead to a worse energy even if there are shifts that are better. Hence, for each M do a complete (preliminary) test for all S in the range.
Note that both MurmurHash3 and Variant13 have a second order sumofsquares avalanche statistic that is very far from what should be expected from a randomly chosen permutation. Variant13 fares better in almost all tests than the original MurmurHash3 but much worse with regard to third order avalanche.
I ran TestU01's BigCrush on my own function, called rrmxmx, MurmurHash3 and Variant13 with a simple counter, similar to how the state update is done in SplittableRandom. pvalues outside $[10^{−3} , 1 − 10^{−3}]$ were flagged as "suspect" and pvalues outside $[10^{−10} , 1 − 10^{−10}]$ were flagged as "failed".
The data in Table 2 below are from using rrmxmx, MurmurHash3 and Variant13 to generate a pseudo random stream in this fashion:
TestU01 works either on 32 bit words, using 30 to 31 bits, or on 64bit doubles. In the tests below, I split each 64bit word in two 32bit halves and use the halves sequentially.
The procedure used in [1] is described in Section 5 of [1]:
Floating in [1] is done by
For a linear congruential generator, right shifting makes sense due to the most significant bits having a longer period than the least significant bits. For a properly designed generator, all output bits should be equally useful.
I have chosen the same number of tests as in [1], 3 runs of BigCrush for each {mixer, gamma and word reversal}tuple. The runs for a particular {mixer, gamma, reversal}tuple differ in the starting point chosen; $runIdx \times \gamma \times 2^{40}$, the first run having $runIdx = 0$. By chosing the starting points this way, the runs will be free from overlap since BigCrush uses about $2^{38}$ values.
In [1], the authors give a slightly low number for the expected number of "suspect" tests;
As can be seen in Table 2, both MurmurHash3 and Variant13 fails horribly for small gamma. This is the code for getting a new gamma when splitting in SplittableRandom:
Let H be a histogram with #C * 65 bins.
for i in 1, ..., n // Do n experiments
x = rnd() // Get random number on [0, 2^64)
h = f(x) // f is our mixer
for c in C
y = h XOR f(x XOR c)
p = hammingWeight(y)
cIdx = indexOf(c in C)
h[cIdx][p]' = h[cIdx][p] + 1
end for
end for
for i in 1, ..., n // Do n experiments
x = rnd() // Get random number on [0, 2^64)
h = f(x) // f is our mixer
for c in C
y = h XOR f(x XOR c)
p = hammingWeight(y)
cIdx = indexOf(c in C)
h[cIdx][p]' = h[cIdx][p] + 1
end for
end for
When this is done, I compute a goodnessoffit for each $h_c$. If $f(x)$ is a good approximation of a random permutation, $h_c$ should be binomially distributed with $B(n, \frac{1}{2})$.
The goodness of fitvalues are stored in an array with $\#C$ values. From this array I compute the mean, $\bar{x}$ and sample standard deviation, $s$. My energy function is simply $\bar{x} + s$. No, I don't have a good theoretical justification for this...
A sounder metric would be to truncate each histogram (bin bits $0, \ldots, k$ and $64  k, \ldots, 64$ into a lowbin and highbin) so that the lowest expected frequency would be around 5, then do a chisquare test and note the pvalues. The pvalues should now look as they were drawn from a $U(0, 1)$ distribution. One problem with this approach is that it would be insensitive to errors where the Hamming distance is very low or very high.
As can be seen from the code above, the compilation of the histogram will take time that is linearly proportional to $n\#C$. If $C$ is large, this is really slow.
To speed things up a bit, for any M, S, R1, R2, I start by using a preliminary value for $n$ as well as a "small" set for $C$. Let's call this preliminary set of constants $C_0$ (currently of size 24384)
and the extended set $C_1$ (currently of size 621408). I now test with the preliminary values. If the energy from the preliminary test is at most 1.1 times the lowest energy seen so far, I run an extensive test with a larger $n$ and $C_1$ instead of $C_0$.
When a new set of constants shows up that are better or close to as good as the old set, the search program tries the Mvalue, mutated first by single bit flips, then by all possible 2bit flips and so on. For each candidate M, shifts (S) in the range 2452 are tried. During the initial search for good constants I noticed that the error landscape isn't smooth at all. In particular decreasing or increasing S by one might lead to a worse energy even if there are shifts that are better. Hence, for each M do a complete (preliminary) test for all S in the range.
2. Results
I plugged MurmurHash3 and Stafford's Variant 13 into my search program and both fared badly. The sample means weren't so bad but the sample standard deviations were terrible. Clearly, for some constants $c$ as described above, the Hamming weights of $f(x) \oplus f(x \oplus c)$ were far from binomially distributed. When isolating the tests to only 2bit mutations, measuring the second order avalanche, both resulted in values far from the expected 1.0.
2.1 An avalanche metric
In [1], Figure 18, this snippet (in Java) is given to compute a first order sumofsquare avalanche statistic, with the bug with 1 vs. 1L as described above corrected:
long[][] A = new long[64][64];
for (long n = 0; n < N; n++) {
long v = rng.nextLong();
long w = mix64(v);
for (int i = 0; i < 64; i++) {
long x = w ^ m.mix(v ^ (1L << i));
for (int j = 0; j < 64; j++) {
if (((x >>> j) & 1) != 0) {
A[i][j] += 1;
}
}
}
}
double sumsq = 0.0;
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
double v = a[j][k]  N / 2.0;
sumsq += v*v;
}
}
double result = sumsq / ((N / 4.0) * 4096.0);
The code above is extended to compute higher order avalanche statistics like this (second order example in C):
for (long n = 0; n < N; n++) {
long v = rng.nextLong();
long w = mix64(v);
for (int i = 0; i < 64; i++) {
long x = w ^ m.mix(v ^ (1L << i));
for (int j = 0; j < 64; j++) {
if (((x >>> j) & 1) != 0) {
A[i][j] += 1;
}
}
}
}
double sumsq = 0.0;
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 64; j++) {
double v = a[j][k]  N / 2.0;
sumsq += v*v;
}
}
double result = sumsq / ((N / 4.0) * 4096.0);
double measureAvalanche2_x(uint64_t (*h) (uint64_t), int runExp,
uint64_t inc, int complemented, int bins) {
assert(2016 % bins != 0); // choose(64, 2) == 2016
uint64_t* A = allocA(bins);
uint64_t C = (complemented ? 0xFFFFFFFFFFFFFFFFUL : 0);
uint64_t N = 1L << runExp;
for (uint64_t n = 0; n < N; n++) {
uint64_t v = n * inc;
uint64_t w = h(v);
int p = 0;
for (int i = 0; i < 64; i++) {
for(int j = i + 1; j < 64; j++) {
uint64_t x = w ^ h(v ^ (1L << i) ^ (1L << j) ^ C);
for (int k = 0; k < 64; k++) {
if (((x >> k) & 1) != 0) {
A[p * 64 + k]++;
}
}
p = (p + 1) % bins;
}
}
}
double result = sumsq(A, bins, N * (2016 / bins));
freeA(A);
return result;
}
uint64_t inc, int complemented, int bins) {
assert(2016 % bins != 0); // choose(64, 2) == 2016
uint64_t* A = allocA(bins);
uint64_t C = (complemented ? 0xFFFFFFFFFFFFFFFFUL : 0);
uint64_t N = 1L << runExp;
for (uint64_t n = 0; n < N; n++) {
uint64_t v = n * inc;
uint64_t w = h(v);
int p = 0;
for (int i = 0; i < 64; i++) {
for(int j = i + 1; j < 64; j++) {
uint64_t x = w ^ h(v ^ (1L << i) ^ (1L << j) ^ C);
for (int k = 0; k < 64; k++) {
if (((x >> k) & 1) != 0) {
A[p * 64 + k]++;
}
}
p = (p + 1) % bins;
}
}
}
double result = sumsq(A, bins, N * (2016 / bins));
freeA(A);
return result;
}
2.2 Measurements
The number of possible $k$bit differences in a 64bit word is ${64\choose k}$. To be able to run the tests for many instances I wrote a CUDAprogram. To create less memory contention, each thread works on a small number of bins instead of the full number. Using the extensions to 2nd, 3rd and 4th order avalanche, with $bins_{1, \ldots, 4} = \{64, 288, 217, 217\}$, here's the avalanche sumofsquares statistic of order 1, 2, 3, 4. Inputs are the products of $a \times [0, 2^{30})$ for $\Delta_1$, $a \times [0, 2^{25})$ for $\Delta_2$, $a \times [0, 2^{20})$ for $\Delta_3$ and $a \times [0, 2^{20})$ for $\Delta_4$:
Mixer  a = 0x40EAD42CA1CD0131  

$\Delta_1$  $\Delta_2$  $\Delta_3$  $\Delta_4$  
rrmxmx  0.975  0.992  1.039  1.005 
Murmur3  1.423  11049.99  1.003  3.004 
Variant13  1.008  2131.30  25.46  1.271 
Note that both MurmurHash3 and Variant13 have a second order sumofsquares avalanche statistic that is very far from what should be expected from a randomly chosen permutation. Variant13 fares better in almost all tests than the original MurmurHash3 but much worse with regard to third order avalanche.
2.3 Testing as a PRNG with TestU01
Neither MurmurHash3 or Variant13 had property 3 in the introduction above ("feeding low entropy data should render a randomlooking output sequence") as a design goal. Since Variant13 and partly, MurmurHash3, are being used as the mixing functions in Java 8's SplittableRandom, they deserve more scrutiny with regard to that property.I ran TestU01's BigCrush on my own function, called rrmxmx, MurmurHash3 and Variant13 with a simple counter, similar to how the state update is done in SplittableRandom. pvalues outside $[10^{−3} , 1 − 10^{−3}]$ were flagged as "suspect" and pvalues outside $[10^{−10} , 1 − 10^{−10}]$ were flagged as "failed".
TestU01 is not geared towards testing random bit streams as such but is (mostly) word oriented with 30 and 31bit words. If one assumes that the output to be tested is a stream of bits rather than a stream of words, it's prudent to at least also feed TestU01 with all output words bitreversed.
The data in Table 2 below are from using rrmxmx, MurmurHash3 and Variant13 to generate a pseudo random stream in this fashion:
uint64_t next(uint64_t *state, uint64_t gamma) {
uint64_t s = *state;
*state += gamma;
return mix(s);
}
This is fed to TestU01 byuint64_t s = *state;
*state += gamma;
return mix(s);
}
static void runTestsInt(unsigned int (*prng)(void), char name[]) {
unif01_Gen* gen = unif01_CreateExternGenBits(name, prng);
bbattery_BigCrush(gen);
unif01_DeleteExternGenBits(gen);
}
Note that I use all 64 bits of output, without floating, and that this is stricter/more revealing.unif01_Gen* gen = unif01_CreateExternGenBits(name, prng);
bbattery_BigCrush(gen);
unif01_DeleteExternGenBits(gen);
}
TestU01 works either on 32 bit words, using 30 to 31 bits, or on 64bit doubles. In the tests below, I split each 64bit word in two 32bit halves and use the halves sequentially.
"TestU01 expects to test 64bit double values, so we use 53 out of every 64 bits generated by our algorithm to generate a double value as for our method nextDouble."This masks some serious defects in the 11 least significant bits of MurmurHash3 and Variant13 when $\gamma$ is small. Except for large $\gamma$, BigCrush generates many more suspect/failurevalues than expected for MurmurHash3 and Variant13. Note that rrmxmx seems to be unaffected by small $\gamma$.
Floating in [1] is done by
(nextLong() >>> 11) * DOUBLE_ULP
thus discarding the 11 least significant bits. This practice in itself is not any better than using the least significant 53 bits, as in
(nextLong() & ((1L << 53)  1)) * DOUBLE_ULP
but seems to be common practice anyway.For a linear congruential generator, right shifting makes sense due to the most significant bits having a longer period than the least significant bits. For a properly designed generator, all output bits should be equally useful.
I have chosen the same number of tests as in [1], 3 runs of BigCrush for each {mixer, gamma and word reversal}tuple. The runs for a particular {mixer, gamma, reversal}tuple differ in the starting point chosen; $runIdx \times \gamma \times 2^{40}$, the first run having $runIdx = 0$. By chosing the starting points this way, the runs will be free from overlap since BigCrush uses about $2^{38}$ values.
In [1], the authors give a slightly low number for the expected number of "suspect" tests;
“TestU01 regards any pvalue outside the range $[10^{10}, 1  10^{10}]$ as indicating “clear failure” of the tested generator. The TestU01 software flags all pvalues outside the range $[10^{3}, 1  10^{3}]$; such values, if they are not clear failures, are regarded as “suspect.” For each generator variant we ran BigCrush 3 times and we report the total number of suspect pvalues and clear failures over all 3 runs; each run of BigCrush performs 160 distinct tests, so 3 runs collectively perform 480 tests. With 480 tests at twosided p = 0.001, we expect even a perfect generator to exhibit 0.96 “suspect” values on average.”I suspect "160" and "480" are typos. Pages 148152 of The "Compact user's guide" [3] for TestU01 list the individual tests and parameters, giving a count of 106. According to page 147 of [3], several tests give more than one pvalue:
"Some of the tests compute more than one statistic using the same stream of random numbers and these statistics are thus not independent. That is why the number of statistics in the summary reports is larger than the number of tests in the description of the batteries."Assuming the pvalues are independent, with 3 runs of with 254 values each and twosided p = 0.001, we expect a perfect generator to exhibit 1.524 “suspect” values on average.
Mixer  $\gamma$ = 1  $\gamma$ = 0x55555555  $\gamma$ = 0xC45A11730CC8FFE3  

Forward  Reverse  Forward  Reverse  Forward  Reverse  
suspect  failed  suspect  failed  suspect  failed  suspect  failed  suspect  failed  suspect  failed  
rrmxmx  3  0  2  0  2  0  1  0  1  0  2  0 
Murmur3  6  47  9  140  10  4  14  37  1  0  1  0 
Varian13  9  24  14  38  9  3  10  1  0  0  1  0 
As can be seen in Table 2, both MurmurHash3 and Variant13 fails horribly for small gamma. This is the code for getting a new gamma when splitting in SplittableRandom:
private static long mixGamma(long z) {
// MurmurHash3 mix constants z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;
z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
z = (z ^ (z >>> 33))  1L; // force to be odd
int n = Long.bitCount(z ^ (z >>> 1)); // ensure enough transitions
return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}
Compared to [1], MurmurHash3 and Variant13 have switched purposes, MurmurHash3 is used for new gammas and Variant13 is used for the prngstream.
Alas, the mixGamma() function above does not prevent bad gammas from occuring. 0x55555555 is, as can be seen by testing, a gamma that fares very badly but is still entirely possible. I do not yet know how large a proportion of the values from mixGamma() are problematic and give rise to a prnginstance that have highly undesirable properties. Melissa O'Neill made the same reflection on bad gammas before me in her blog post "Bugs in SplitMix(es)". They are trivial to construct, the question is at what frequency they are showing up in practice?
// MurmurHash3 mix constants z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdL;
z = (z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53L;
z = (z ^ (z >>> 33))  1L; // force to be odd
int n = Long.bitCount(z ^ (z >>> 1)); // ensure enough transitions
return (n < 24) ? z ^ 0xaaaaaaaaaaaaaaaaL : z;
}
2.4 Testing as a PRNG with PractRand
I also decided to test with PractRand, V0.93, with options "tf 2 te 1 tlmin 1KB". All values of $\gamma$ in Table 1 of [1] marked "poor" and "okay" fail PractRand after a very small amount of output from MurmurHash3 and Variant13. The gammas marked as "random" in Table 1 of [1] fail with MurmurHash3 as the mixing function after at most $2^{41}$ bytes.
It's clear that Variant13 is more well behaved than MurmurHash3, but Variant3 fails PractRand quite rapidly for many gammas. Are the gammas marked as "random" a fluke or the norm with regards to how well behaved Variant13 is? I'll try some experiments in a blog post to come, trying to estimate the likelihood that SplittableRandom gets a bad gamma.
$\gamma$  rrmxmx  Murmur3  Variant13 

0x0000000000000001
 $2^{44}$  $2^{16}$  $2^{21}$ 
0x0000000000000003
 $2^{47.71}$  $2^{18}$  $2^{22}$ 
0x0000000000000005
 $> 2^{48}$  $2^{19}$  $2^{21}$ 
0x0000000000000009
 $> 2^{48}$  $2^{19}$  $2^{22}$ 
0x0000010000000001
 $2^{46.41}$  $2^{18}$  $2^{20}$ 
0xffffffffffffffff
 $2^{45.62}$  $2^{16}$  $2^{21}$ 
0x0000000000ffffff
 $2^{45.39}$  $2^{17}$  $2^{22}$ 
0xffffff0000000001
 $2^{46.55}$  $2^{18}$  $2^{22}$ 
0x0000000000555555
 $2^{47.42}$  $2^{25}$  $2^{30}$ 
0x1111111111110001
 $> 2^{48}$  $2^{27}$  $2^{32}$ 
0x7777777777770001
 $> 2^{48}$  $2^{28}$  $2^{34}$ 
0x7f7f7f7f33333333
 $> 2^{48}$  $2^{31}$  $2^{33}$ 
0x5555550000000001
 $> 2^{48}$  $2^{24}$  $2^{30}$ 
0xc45a11730cc8ffe3
 ?  $2^{39}$  $> 2^{42}$ 
0x2b13b77d0b289bbd
 ?  $2^{39}$  $> 2^{42}$ 
0x40ead42ca1cd0131
 ?  $2^{40}$  $> 2^{42}$ 
After $2^{43}$ bytes, PractRand has yet not complained when running Variant13 with $\gamma = 40EAD42CA1CD0131_{16}$.
rrmxmx fails at the $2^{44}$ level with $\gamma = 1$. With $\gamma = 3$, it passes PractRand up to $2^{45}$ bytes but with one test marked as suspicious at that level;
rrmxmx fails at the $2^{44}$ level with $\gamma = 1$. With $\gamma = 3$, it passes PractRand up to $2^{45}$ bytes but with one test marked as suspicious at that level;
rng=RNG_stdin, seed=0xca839ae9
length= 32 terabytes (2^45 bytes), time= 506449 seconds
Test Name Raw Processed Evaluation
BCFN_FF(2+2):freq R= +8.3 p~= 1e7 suspicious
...and 1900 test result(s) without anomalies
length= 32 terabytes (2^45 bytes), time= 506449 seconds
Test Name Raw Processed Evaluation
BCFN_FF(2+2):freq R= +8.3 p~= 1e7 suspicious
...and 1900 test result(s) without anomalies
Considering that the test at the $2^{45}$ level took 140 hours I won't run a test at $2^{46}$ level but some caution is recommended if running rrmxmx as a PRNG with very sparse (1 or 2bit) gammas. Currently I am running TestU01 for gammas of the form $2^n$ for $0 < n < 25$. Assuming rrmxmx behaves as a randomly chosen permutation, these gammas should pass too. One would of course never use them for a PRNG considering that they shorten the period to $2^{64  n}$.
3. Speed
The operations in rrmxmx are identical to MurmurHash3 except for the first step; instead of a right shift and xor there are two rotates and 2 xors. With my machine (i75820), ILP makes the speed of rrmxmx, MurmurHash3 and Variant13 identical as far as I can measure.
4. Further work
Assuming
x ^= ror64(x, a) ^ ror64(x, b)
is just as fast as
x ^= x >> a
(On my Intel i75820 and on AMD Ryzen 7 1700 it is, for many other processors, it isn't.)
A search should be made to see if there could be a better version looking like this:
uint64_t rrmxmx2(uint64_t v) {
uint64_t v2 = v;
v2 ^= ror64(v2, R1) ^ ror64(v2, R2);
v2 *= M1;
v2 ^= ror64(v2, R3) ^ ror64(v2, R4);
v2 *= M2;
return v2 ^ ror64(v2, R5) ^ ror64(v2, R6);
}
uint64_t v2 = v;
v2 ^= ror64(v2, R1) ^ ror64(v2, R2);
v2 *= M1;
v2 ^= ror64(v2, R3) ^ ror64(v2, R4);
v2 *= M2;
return v2 ^ ror64(v2, R5) ^ ror64(v2, R6);
}
The indication that rrmxmx with the constants M = 0x9FB21C651E98DF25, S = 28, R1 = 49, R2 = 24 can be improved is that there is a clear difference when looking at the goodnessoffit for different constants c as described in section 1.1 above.
5. Conclusion (TL;DR)
If you have the need for a reasonably fast permutation on 64bit words that looks quite random, this function might be useful. MurmurHash3 and Variant13 are very easy to distinguish from random permutations as shown above.
There may be better constants for MurmurHash13 than David Stafford's improvements; I have not been searching using my energy function.
There may be better constants for MurmurHash13 than David Stafford's improvements; I have not been searching using my energy function.
A. Source code
A.1 C:
#include <stdint.h>
static inline uint64_t ror64(uint64_t v, int r) {
return (v >> r)  (v << (64  r));
}
uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, 49) ^ ror64(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
uint64_t xorRor24_49Inv(uint64_t v) {
const int rors[] = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
uint64_t acc = 0;
for (int i = 0; i < sizeof(rors) / sizeof(int); i++) {
acc ^= ror64(v, rors[i]);
}
return acc ^ v;
}
static uint64_t xorShift28Inv(uint64_t v) {
return v ^ (v >> 28) ^ (v >> 56);
}
uint64_t unrrmxmx(uint64_t v) {
uint64_t x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return xorRor24_49Inv(x);
}
static inline uint64_t ror64(uint64_t v, int r) {
return (v >> r)  (v << (64  r));
}
uint64_t rrmxmx(uint64_t v) {
v ^= ror64(v, 49) ^ ror64(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
uint64_t xorRor24_49Inv(uint64_t v) {
const int rors[] = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
uint64_t acc = 0;
for (int i = 0; i < sizeof(rors) / sizeof(int); i++) {
acc ^= ror64(v, rors[i]);
}
return acc ^ v;
}
static uint64_t xorShift28Inv(uint64_t v) {
return v ^ (v >> 28) ^ (v >> 56);
}
uint64_t unrrmxmx(uint64_t v) {
uint64_t x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return xorRor24_49Inv(x);
}
A.2 Java:
public static long rrmxmx(long v) {
v ^= Long.rotateRight(v, 49) ^ Long.rotateRight(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >>> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >>> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
public static long unrrmxmx(long v) {
long x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return invertRor2449(x);
}
private static long invertRor2449(long v) {
int[] rots = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
long acc = 0;
for (int r : rots) {
acc ^= Long.rotateRight(v, r);
}
return acc ^ v;
}
private static long xorShift28Inv(long v) {
return v ^ (v >>> 28) ^ (v >>> 56);
}
v ^= Long.rotateRight(v, 49) ^ Long.rotateRight(v, 24);
v *= 0x9FB21C651E98DF25L;
v ^= v >>> 28;
v *= 0x9FB21C651E98DF25L;
return v ^ v >>> 28;
}
// Code below only necessary if one wants to compute the inverse rrmxmx.
public static long unrrmxmx(long v) {
long x = xorShift28Inv(v) * 0x2AB9C720D1024ADL;
x = xorShift28Inv(x) * 0x2AB9C720D1024ADL;
return invertRor2449(x);
}
private static long invertRor2449(long v) {
int[] rots = { 4, 8, 9, 11, 15, 16, 18, 20, 24,
25, 26, 29, 30, 32, 40, 41, 43, 44,
45, 48, 50, 54, 56, 57, 58, 60 };
long acc = 0;
for (int r : rots) {
acc ^= Long.rotateRight(v, r);
}
return acc ^ v;
}
private static long xorShift28Inv(long v) {
return v ^ (v >>> 28) ^ (v >>> 56);
}
B. Expected output/test vectors
0x0000000000000000

0x0000000000000000

0x0000000000000000

0x0000000000000000
 
0x0000000000000001

0x23085d6f7a569905

0x56ed9162154faac0

0x0000000000000001
 
0x0000000000000003

0xcaea878c77a59454

0x0ec1bfbe6983c5a0

0x0000000000000003
 
0x0000000000000007

0xa77bd5a63a7785c5

0x1718113ac9a1f119

0x0000000000000007
 
0x0101010101010101

0x36cb9e821eca6c5b

0xfa63351a390851cd

0x0101010101010101
 
0x0123456789abcdef

0xc337a528d7e42497

0x7529d4da142b1f1c

0x0123456789abcdef
 
0x084c2a6e195d3b7f

0x507d53f1ba22542c

0xec3694cd1c80b9cd

0x084c2a6e195d3b7f
 
0x1000000000000001

0xedd3f3f24766de89

0xdb302dae3ad882e0

0x1000000000000001
 
0x1111111111111111

0x7547f019c63c1df3

0xea6d9bbf167027c9

0x1111111111111111
 
0x1fffffffffffffff

0x05e3c8367d6677d6

0x7fbbf24327033cf0

0x1fffffffffffffff
 
0x3fffffffffffffff

0x47e7c1e973d349ff

0x240ba915bbb5e089

0x3fffffffffffffff
 
0x6666666666666666

0xd9c6e8c9ecd1e30a

0xf4b9c6565f8d9529

0x6666666666666666
 
0x7777777777777777

0x29823cb92ada0068

0xdca549733043f019

0x7777777777777777
 
0x7f7f7f7f7f7f7f7f

0xc58024da69c2eb57

0xf1d5238b66aaaf5e

0x7f7f7f7f7f7f7f7f
 
0x7ffffffffffffff7

0x30c8918fcb6b2b3c

0x3a836e49ca560dd8

0x7ffffffffffffff7
 
0x7fffffffffffffff

0x91b750beb6849d8f

0x90354478a1b6e49d

0x7fffffffffffffff
 
0x8000000000000000

0x5e2d59ded82568fc

0xa0f3362cbce5bedb

0x8000000000000000
 
0x8000000000000008

0xae03d8a5f03d42bb

0xed1a6dc89b6e22d2

0x8000000000000008
 
0x8080808080808080

0x269ed61ad0d4a3ad

0xcf8b0a0dccbf9da9

0x8080808080808080
 
0x8888888888888888

0x2f6af135bf8e9d79

0x2c50b3a1d5c7a854

0x8888888888888888
 
0x9999999999999999

0x50a99564c864eb28

0x6ae2b8e14b6d3c7c

0x9999999999999999
 
0xc000000000000000

0xf5f0f95fcd968a80

0x6ae70fea73bd7a6d

0xc000000000000000
 
0xe000000000000000

0x160c347d11027361

0x9a3d176b24d68305

0xe000000000000000
 
0xeeeeeeeeeeeeeeee

0x9f9714241fb64d9e

0x0a40b8632cad4bfa

0xeeeeeeeeeeeeeeee
 
0xeffffffffffffffe

0x742025f2e92e6aec

0xf7eaaefaaa16ddb8

0xeffffffffffffffe
 
0xf7b3d591e6a2c480

0x60f421f08a38d500

0xf520f63f955ac204

0xf7b3d591e6a2c480
 
0xfedcba9876543210

0x8fec24c21c6d66de

0xf18dbb478c6d3943

0xfedcba9876543210
 
0xfefefefefefefefe

0x125c8836f02c998f

0xe4b673f0521ad37d

0xfefefefefefefefe
 
0xfffffffffffffff8

0x6018ed12f08b6eec

0x1b32e354639f82f1

0xfffffffffffffff8
 
0xfffffffffffffffc

0x420b85f7b23fa512

0xe317247fad148210

0xfffffffffffffffc
 
0xfffffffffffffffe

0xc320bdd84877d048

0x31c9d93c42d48cea

0xfffffffffffffffe
 
0xffffffffffffffff

0x8bc57fddf83265bd

0xb694bf1eaa6682c4

0xffffffffffffffff

This is a great writeup. I am having a terrible feeling the question I am about to ask may sound ridiculously stupid. Why is function "y = x ^ ror(x, a) ^ ror(x, b)" bijective? A simple case when x = 101, a = 1, b = 2 results in y = 000; similarly when x = 110 or x = 011 the function also produces y = 000. I am almost convinced I am missing something
SvaraRaderaI assume you mean 101 in binary, 5 in (hexa)decimal.
RaderaFirst case, x = 101_2:
ror64(0x5, 1) == 0x8000000000000002
ror64(0x5, 2) == 0x4000000000000001
0x0000000000000005
0x8000000000000002
^ 0x4000000000000001

0xC000000000000008
Second case, x = 110_2
ror64(0x6, 1) == 0x0000000000000003
ror64(0x6, 2) == 0x8000000000000001
0x0000000000000003
0x8000000000000001
^ 0x0000000000000006

0x800000000000000A != 0xC000000000000008
That x ^ ror64(x, a) ^ ror64(x, b) is bijective might not be entirely obvious. Note that x ^ ror64(x, a) isn't, for any a.
I checked by verifying that there exists an inverse for every {a, b} when using 64 bits. Since rotates, shifts and xor all can be expressed by multiplication with a binary matrix, I constructed the relevant matrices in Octave and checked that there exists an inverse for each matrix.
Perhaps not so elegant but it got the job done.
Coming up is a post for word sizes 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 20, 24, 30, 36, 40, 48, 56 & 64. Plus variants. I guess I should put the constants up on github, for the cases where not all pairs are bijective.
One could of course also mix rotations and shifts or have more than 2 rotations/shifts for a single xor.
I did compute all invertible
pairs {a, b} for
f_2(x, a, b) { return x ^ (x >> a) ^ (x << b); }
triples {a, b, c} for
f_3(x, a, b) { return x ^ (x >> a) ^ (x >> c) ^ (x << b); }
and quadruples {a, b, c, d} for
f_3(x, a, b) { return x ^ (x >> a) ^ (x >> c) ^ (x << b) ^ (x << d); }
a for 32 and 64bits a while ago.
You might want to read this new post: http://mostlymangling.blogspot.com/2018/09/invertibleadditionsmod2.html
I should have guessed that "x ^= ror(x, a) ^ ror(x, b)" implies number of bits is 64; I erroneously assumed it was generic. My example above doesn't really apply to your calculations which are only dealing with 64 bit numbers. It looks like the function is bijective when a number of bits is not divisible by 3. I knew I was missing something, thanks for clarification.
RaderaI found an interesting result based on this post, using xorrotaterotate but omitting one of the xorshifts. In this code, rotate64 is a leftrotation defined in a macro:
SvaraRaderauint64_t state = UINT64_C(0xEB5F2AE4); // any should work, this was tested
uint64_t random64() {
uint64_t z = (++state ^ UINT64_C(0xDB4F0B9175AE2165)) * UINT64_C(0x4823A80B2006E21B);
z = (z ^ rotate64(z, 52) ^ rotate64(z, 21) ^ UINT64_C(0x9E3779B97F4A7C15)) * UINT64_C(0x81383173);
return z ^ z >> 28;
}
(I probably made some elementary C++ mistakes in the UINT64_C usage.) This is one of very few gamma=1 unary hashes I have tested that passes 32TB of PractRand, and is probably the fastest of the ones I have found due to omitting one of the xorshiftmultiply lines. Any multiplications use an XLCG's step of a XOR with a constant that has the last three bits 101, then a a multiplication by a constant with last three bits 011 (this makes an input of 0 produce a nonzero output). The initial step acts like a multiplication by a large gamma, but the XOR preceding it means that the modular multiplicative inverse of the gamma can't be used accidentally as the stride between inputs. I don't know why this set of rotation amounts, xors, shifts, and multiplies works but most others, when substituted in, don't. Some multipliers were chosen for having a low population count, in the hope that they would be faster. The XOR constants are 2 to the 64 divided by generalizations of the golden ratio, with the last bit adjusted to be odd. I hope this hash can be useful or interesting!
Something that you may want to check out, part of a post coming up Real Soon Now, is feeding a rotated and bit reversed unary counters. That is 128 possible counters, feeding each to the mixing function and run the output though PractRand. So far, many failures out of 128 tests (running with  tf 2 tlmax 1TB or so) seems to be a strong indicator that the function will eventually fail for a nonrotated unary counters. The MurmurHash3 finalizer and Variant 13 fails quite spectacularly. So far I have one candidate that fails one out of 128 tests after 1TB on that particular test. It uses two (rorrorxor)mul steps and one shiftxor. Not the same constants used above. I also have a candidate with no failures so far but it has only passed about 16 put of the 128 subtests.
RaderaI ran a complete test, 128 instances with counters as described above and the results were not encouraging:
Radera
Will test ./ettinger increment 1 seed 0  RNG_test stdin64 tf 2 te 0 multithreaded tlmax 1TB tlmin 1KB
00F:40 *01F:15* *02F:15* *03F:14* *04F:15* *05F:15* *06F:16* *07F:16*
*08F:15* *09F:16* *10F:15* *11F:15* *12F:15* *13F:16* *14F:16* *15F:16*
*16F:16* *17F:16* *18F:16* *19F:17* *20F:16* *21F:18* *22F:19* *23F:18*
*24F:18* *25F:17* *26F:19* *27F:17* *28F:17* *29F:18* *30F:19* *31F:20*
*32F:20* *33F:21* *34F:22* *35F:22* *36F:23* *37F:19* *38F:19* *39F:16*
*40F:17* *41F:19* *42F:19* *43F:22* *44F:24* *45F:28* *46F:33* *47F:35*
*48F:36* *49F:35* *50F:37* *51F:37* *52F:33* *53F:37* *54F:37* *55F:38*
*56F:39* 57F:40 58F:40 59F:40 60F:40 61F:40 62F:40 63F:40
*00R:14* *01R:15* *02R:15* *03R:16* *04R:15* *05R:16* *06R:16* *07R:16*
*08R:16* *09R:16* *10R:15* *11R:15* *12R:17* *13R:17* *14R:17* *15R:17*
*16R:18* *17R:19* *18R:18* *19R:19* *20R:20* *21R:18* *22R:17* *23R:21*
*24R:25* *25R:25* *26R:25* *27R:23* *28R:19* *29R:19* *30R:21* *31R:16*
*32R:16* *33R:16* *34R:21* *35R:19* *36R:19* *37R:19* *38R:21* *39R:22*
*40R:25* *41R:27* *42R:26* *43R:25* *44R:24* *45R:23* *46R:22* *47R:21*
*48R:20* *49R:19* *50R:18* *51R:17* *52R:17* *53R:17* *54R:17* *55R:17*
*56R:17* *57R:17* *58R:17* *59R:17* *60R:16* *61R:15* *62R:15* *63R:15*
120 failures out of 128 tests with max 2^40 bytes.

*XXF:AA* indicates that mix(ror64(ctr++, XX)) failed after 2^AA analyzed bytes.
XXR:AA indicates that mix(ror64(reverse(ctr++), XX)) passed (tlmax at 1TB, 2^40).
So it behaves quite well with gamma=1 but that is a rather special case. As an approximation of a random permutation, it's not particularly strong.
I'll try to find time later tonight to post the best I've found so far, (1 failure after 2^39 bytes, at ror64(reverse(ctr), 14)).
(Your proposal is still much better than MurmurHash3 and Variant13.)
So there: https://mostlymangling.blogspot.com/2019/01/betterstrongermixerandtestprocedure.html
RaderaI'd be very interested to know what flags to PractRand you were using when it passed 32TB with a unitary counter.
I used the command line:
Radera$ RNG_test.exe Ettinger multithreaded seed 0xa6948409
But my PractRand code has been modified (not the tests, just adding random number generators). I added this code to mult.c in the "not recommended" section of the RNGs (rotate64 is defined by PractRand as a left rotation, not a ror64 like with the other code you're comparing with):
Uint64 Ettinger::raw64() {
//// passes 32TB with one anomaly, unusual at only 2GB: [Low16/64]BCFN(2+2,132,T)
uint64_t z = (++state ^ UINT64_C(0xDB4F0B9175AE2165)) * UINT64_C(0x4823A80B2006E21B);
z = (z ^ rotate64(z, 52) ^ rotate64(z, 21) ^ UINT64_C(0x9E3779B97F4A7C15)) * UINT64_C(0x81383173);
return z ^ z >> 28;
}
std::string Ettinger::get_name() const { return "Ettinger"; }
void Ettinger::walk_state(StateWalkingObject *walker) {
walker>handle(state);
}
I added this in mult.h:
class Ettinger : public vRNG64 {
Uint64 state;
public:
Uint64 raw64();
std::string get_name() const;
void walk_state(StateWalkingObject *);
};
And it's added to the listing in RNG_from_name.h, for the section related to mult.h :
REGISTER_RNG_0(Ettinger)
Hope that helps. It looks like this one might be strong only when the counter is 1 and nothing is rotated.
If you have the time, please test it with tf 2 too. That is the "standard level" I have been using when testing MurmurHash3, Variant 13 and rrmxmx (as well as my latest and greatest).
RaderaSorry about the sloppiness, will rerun the test with *left*rotation and post the results.
Den här kommentaren har tagits bort av skribenten.
SvaraRaderaHey Pelle, interesting work. Do you also have variants for 32bit mixing?
SvaraRaderaHello, you're doing a great job and I enjoyed reading your block. What license do you publish your rmxmx code under? Can I use it in my projects?
SvaraRaderaConsider it public domain, credit me if you want to.
RaderaNote however that this was one of my first (and worst) attempts at a good 64 bit mixer.
A better rrxmrrxmxsmixer is here:
https://mostlymangling.blogspot.com/2019/01/
A much better (and slightly slower) alternative is here:
https://mostlymangling.blogspot.com/2020/01/
The best compromise speed/quality I've seen is Jon Maiga's mx3 mixer:
http://jonkagstrom.com/mx3/mx3_rev2.html
The extra multiply and shift in mx3 is slightly slower than Murmur3style mixers (3 xs, 2ms) but provides vastly superior results from a statistical standpoint. According to my measurements, it's on most platforms about as fast as NASAM.
If your main platform is older GPUs (< GeForce RTX), NASAM is probably faster due to 64 bit multiplications being expensive.
On more recent GPUs and modern 64 bit CPUs, mx3 is faster.
Thank you very much! Yes, of course I wouldn't take anything without credit. I'm interested in applications like short string hashing and fast generation of pseudorandom values from randomized but not uniformly distributed sources. But nothing that requires gigabytes per second. For example, with strings like local file paths I was sure that MurmurHash3 will be more than good enough, but no, I've encountered more than a few collisions of 64bit hashes. Which is, of course, always possible, but shouldn't be that common.
RaderaOh, and before that I was sure FNV1a will be more than enough, but had to upgrade to Murmur3 quite quickly :)
P. S. I have now read every entry in your blog, looking forward for more, and thanks a lot for your work and for making it public,