måndag 8 november 2021

Applying the RRC-64-test to the degski64 and Lea64 mixers

In LXM: better splittable pseudorandom number generators (and almost as fast) by Guy L Steele and Sebastiano Vigna, approaches for pseudo-random number generation through mixing PRNGs with different structure by way of a non-linear mixer are discussed. The three 64-bit mixers considered in the paper are Austin Appleby's Murmur3 (and Stafford's Variant 13 of Murmur3), of which I've written about here and here, degski64 and Lea64, described in the LXM paper.

degski64 and Lea64 are similar to Murmur3 in structure but use xor-shifts by 32 bits exclusively (Murmur3 included for reference):

uint64_t murmur64(uint64_t z) {
  z ^= (z >> 33);
  z *= 0xff51afd7ed558ccdull;
  z ^= (z >> 33);
  z *= 0xc4ceb9fe1a85ec53ull;
  return z ^ (z >> 33);
}

uint64_t lea64(uint64_t z) {
  z ^= (z >> 32);
  z *= 0xdaba0b6eb09322e3ull;
  z ^= (z >> 32);
  z *= 0xdaba0b6eb09322e3ull;
  return z ^ (z >> 32);
}

uint64_t degski64(uint64_t z) {
  z ^= (z >> 32);
  z *= 0xd6e8feb86659fd93ull;
  z ^= (z >> 32);
  z *= 0xd6e8feb86659fd93ull;
  return z ^ (z >> 32);
}
I am not aware of the design rationale for the 32 bit shifts; one would think that they would be faster on some platforms. A short test with clang-12 on a Ryzen 5900X gives timings for the 4 functions that are within 0.5% of each other, my guess is that they are equally fast (sampled 200 Gops/each mixer, ~2.57 Gops/s).

Applying RRC-64 to degski64 and Lea64

Sadly, neither mixer is particularly strong but fails RRC-64 about as badly as the original Murmur3/Variant13. Moremur is, at least as far as RRC-64 goes, a much stronger mixer with the same number of operations.

As can be seen from the tables below, the constants used in degski64 and Lea64 are no stronger (w.r.t. RRC-64) than Murmur3/Variant 13 but much weaker than Moremur.

degski64 failures, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 17171517151516151517161515141515 14141415141716151716171515171615
16 14151515151515151515151515151516 17151615161615171515151517151716
32 16161517151516151517161614141415 14141415141716151716151515171615
48 14151515151515151515151515151517 17151615171615171515151517161717
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 17171517151516161516161514141415 14141615141715151716151515171615
16 14151515151514151515151515151616 17151615171615171515151517151716
32 15161517151516151516151614141515 14141415141715151716151515171615
48 14151515151515151415151515151617 17151615171615171515151517151717
Lea64 failures, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 17171517151516151517161515141515 14141415141716151716171515171615
16 14151515151515151515151515151516 17151615161615171515151517151716
32 16161517151516151517161614141415 14141415141716151716151515171615
48 14151515151515151515151515151517 17151615171615171515151517161717
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 17171517151516161516161514141415 14141615141715151716151515171615
16 14151515151514151515151515151616 17151615171615171515151517151716
32 15161517151516151516151614141515 14141415141715151716151515171615
48 14151515151515151415151515151617 17151615171615171515151517151717
Moremur failures, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3319191922222425282728291616161716171919202827283429293536373738
16 1717181919191920212223232425262736353539403431313940393837343436
32 2829303131323333343434343433323228343332323133333434343535363736
48 3228282829292929303030303131313231313133363736283334313128252219
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3319191922222425282918281516161716171921212628303431343436363737
16 1718181919191919212223242425262736353539403431313940393837343436
32 2829303032333334343535343432333228343332323133333434343536363636
48 3228282828282929293030303131323231313133353534283232313028252219

Conclusion

It would be interesting to see to what extent a (much) stronger mixer would affect the results in the LXM paper. I believe that Moremur is a stronger drop-in replacement for Murmur3, Variant13, degski64 and Lea64 with comparable, if not identical, speed.

fredag 3 januari 2020

NASAM: Not Another Strange Acronym Mixer!

NASAM: Not Another Strange Acronym Mixer

The various abbreviations became more and more unpronouncable. I give you "NASAM".

This is a new mixer that has been tested quite thoroughly. Since I don't have more than 2 machines at my disposal (a Ryzen 7 1700 and an i7-5820), testing takes quite some time. I will complete this posting with data on sparse gammas and other interesting things I may think of.

This is the fastest permutation function I am aware of that passes RRC-64-42-TF2-0.94. Tommy Ettinger's function Pelican also passes RRC-64-42-TF2-0.94 and has similar speed.

NASAM mixer in C.
uint64_t nasam(uint64_t x) {
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}
Pelican has one advantage over NASAM; Pelican has no fixed point at 0. There are at least three obvious ways of fixing this defect, should it be deemed necessary for a particular use-case:
  1. Adding a constant after the first multiplication ("rrma2xsm2xs" in this post).
  2. Xoring a constant prior to rotating/xoring ("xNASAM" in this post).
  3. Xoring a constant prior to rotating/xoring and after the last xor-shift ("xNASAMx" in this post).

Assuming the statistical quality of 1--3 above is at least as good as NASAM, the most interesting thing to analyze is of course speed. xNASAMx also has another property that might be of interest in some use cases.

For speed measurements, I've used Chris Wellon's shootout-program, described here.

Speed in MB/s, compiled with -Ofast -march=native. The highest speed for each mixer is marked in green, the lowest in red. The % column gives the speed relative to SplitMix64.
i7-5820Ryzen 7 1700
Mixergcc 9.2.1clang 8.0.1%gcc 9.2.1clang 8.0.1%
baseline1016510213-86398681-
splitmix6462767334100.00%64716971100.00%
rrmxmx6091606483.04%5949636191.24%
NASAM4412448961.20%4385481669.08%
Pelican4320438259.74%4256463866.53%
xNASAM4156431358.81%3959463266.45%
rrma2xsm2xs4278421458.33%4083447564.20%
xNASAMx3921406755.46%3762380654.60%

rrma2xsm2xs

The preliminary name of NASAM was rrm2xsm2xs for "2 rotations, multiply, 2 term xor-shift, multiply, 2 term xor-shift". The "rrma..." is simply an addition after the first multiply:
rrma2xsm2xs mixer in C.
uint64_t rrma2xsm2xs(uint64_t x) {
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x = x * 0x9E6C63D0676A9A99UL + C; // Avoids trivial fixpoint at 0.
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}
Special care would have to be taken for the constant C above to be used for anything interesting but avoiding the fix point at 0. In particular, it must not be used for "independent streams"; AFAICT, they will be strongly correlated due to the fact that C will come into play quite late in the mixing.

It is worth noting that the speed penalty compared to NASAM is small (4%--7%).

xNASAM

The same as NASAM with xor of a constant before the mixing steps. Adding instead of xoring would be equivalent to a linear offset in the sequence.

This is much more promising as a construction where the constant could be used for something resembling independent streams. Some care will have to be taken; I think there's a good way of segmenting a 64-bit stream index into an increment part and an xor-part for a total of $2^{64}$ distinct streams, each with period $2^{64}$.

xNASAM mixer in C.
uint64_t xnasam(uint64_t x, uint64_t c) {
  x ^= c;
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}

xNASAMx

The same as NASAM with xor of a constant before and after the mixing steps.

This has the interesting property that it to a small extent masks the counter and the increment used. The idea is similar to the XEX encryption mode, using NASAM as an unkeyed block cipher. This has zero cryptographic security, of course. Still it may be useful if you have a use case where a minor inconvience with regards to deducing the counter and/or increment is appropriate. The last xorstep makes the function harder to invert, without knowing the constant used.

This version is also the slowest of the mixers published here at 54%--56% of the speed of SplitMix and 80%--90% of the speed of NASAM. I somehow doubt that it behaves any better than xNASAM when considering using the constant to generate distinct streams.

xNASAMx mixer in C.
uint64_t xnasamx(uint64_t x, uint64_t c) {
  x ^= c;
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;
  x ^= c;

  return x;
}

Below, 256 subtests of 4 TB each (1 PB total) didn't show any anomalies. This of course doesn't mean that there aren't any; just that RRC-64-42-TF2-0.94 is a too weak test to detect them.

(Lack of) NASAM failures, $2^{42}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 4242424242424242424242424242424242424242424242424242424242424242
16 4242424242424242424242424242424242424242424242424242424242424242
32 4242424242424242424242424242424242424242424242424242424242424242
48 4242424242424242424242424242424242424242424242424242424242424242
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 4242424242424242424242424242424242424242424242424242424242424242
16 4242424242424242424242424242424242424242424242424242424242424242
32 4242424242424242424242424242424242424242424242424242424242424242
48 4242424242424242424242424242424242424242424242424242424242424242

måndag 16 december 2019

Stronger, better, morer, Moremur; a better Murmur3-type mixer.

I'll keep this brief: The original constants used in MurmurHash3 and Stafford's Variant 13 are not as good as they could have been. This affects the strength of SplitMix64.

The testing procedure I've used is described in an earlier post. There is one extension, compared to the earlier post: every rotation and reversal is also complemented for a total of 256 subtests. I'll call the extended version "RRC-64-40-TF2-0.94" for "Reversed, rotated, complemented-64 bits, 2^40 bytes, PractRand 0.94 with -tf 2". Without further ado, here's "Moremur", a statistically stronger but equally fast mixer:

uint64_t moremur(uint64_t x) {
  x ^= x >> 27;
  x *= 0x3C79AC492BA7B653UL;
  x ^= x >> 33;
  x *= 0x1C69B3F74AC4AE35UL;
  x ^= x >> 27;

  return x;
}
One strong flaw with this construction, i.e. the first operation being an xor-shift, is that there are many increments that will lead to few or none of the less significant bits being one. This in turn implies that the less significant bits will be zero after the first two operations, xor-shift and multiply. Particularly bad increments have this form: $\gamma = (2^s + 1)a, a \lt 2^s$, $s$ being the first shift and $a$ being sparse.

In the table below, Moremur outperforms Murmur3 and Variant13 by quite some margin.

Table 1: Test length where PractRand fails
$\gamma$ Murmur3 Variant13 Moremur
0x0000000000000001
$2^{17}$ $2^{19}$ $2^{33}$
0x0000000000000003
$2^{17}$ $2^{17}$ $2^{34}$
0x0000000000000005
$2^{17}$ $2^{18}$ $2^{34}$
0x0000000000000009
$2^{16}$ $2^{18}$ $2^{34}$
0x0000010000000001
$2^{18}$ $2^{21}$ $2^{41}$
0xffffffffffffffff
$2^{17}$ $2^{19}$ $2^{33}$
0x0000000000ffffff
$2^{17}$ $2^{23}$ $2^{37}$
0xffffff0000000001
$2^{18}$ $2^{23}$ $2^{43}$
0x0000000000555555
$2^{18}$ $2^{28}$ $2^{42}$
0x1111111111110001
$2^{25}$ $2^{30}$ $> 2^{46}$
0x7777777777770001
$2^{28}$ $2^{34}$ $> 2^{45}$
0x7f7f7f7f33333333
$2^{30}$ $2^{33}$ $> 2^{46}$
0x5555550000000001
$2^{21}$ $2^{27}$ $2^{45}$
0xc45a11730cc8ffe3
$2^{39}$ $> 2^{42}$ $2^{?}$
0x2b13b77d0b289bbd
$2^{39}$ $> 2^{42}$ $2^{?}$
0x40ead42ca1cd0131
$2^{40}$ $> 2^{42}$ $2^{?}$
As can be seen from the tables below, these constants are a great improvement over the ones I've been aware of so far (original MurmurHash3 and Variant13). Moremur still fails RRC-64-40-TF2-0.94 but far from as badly as the earlier constants do.
Moremur failures, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3319191922222425282728291616161716171919202827283429293536373738
16 1717181919191920212223232425262736353539403431313940393837343436
32 2829303131323333343434343433323228343332323133333434343535363736
48 3228282829292929303030303131313231313133363736283334313128252219
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3319191922222425282918281516161716171921212628303431343436363737
16 1718181919191919212223242425262736353539403431313940393837343436
32 2829303032333334343535343432333228343332323133333434343536363636
48 3228282828282929293030303131323231313133353534283232313028252219
Murmur3 failures, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1718181817161616161515171514151515171818181717161614141414141417
16 1414141414151515151516161616151517171617171717181917181817161414
32 1616171716161414161514151515151515161718181717171716141414141515
48 1514141414141515151515161717171717171717171717181819191818181716
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1717181717151616161515171415151514171818171716171615141414141617
16 1414141415151515151516161616151517171617171717181818171615151414
32 1617171716141414151514151415151415151718181717171716151414141415
48 1414141414141515151515161717171717171716171717171819191818181715
Variant 13, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1917181818171818181818191916171716171818181919202019202019202019
16 1717161716161717161716171717181820192118192119222021222120222219
32 1919191919191919191920201920191918181819192020192020192020202019
48 1717161717171617161717171818191921181821192220211918181818181818
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1917181818171818181818181916171716171818181919202019202019202019
16 1717161716171717161716171717171820192118182119222021222120222219
32 1919191919191919191920201920201918181819192020192020192020192019
48 1617161617171617161717171818191921191921192220211918181818181818
The constants were found through a stochastic hill climber I wrote using CUDA. The first testing I did of Murmur3 and Variant13 revealed that they had fairly weak higher order avalanche properties. Letting the fitness function include metrics for 1st to 4th order avalanche proved to be quite successful. The CUDA program was used for sieving; once some possibly good candidates turned up, I ran them through RRC-64-40-TF2-0.94. The constants above were the best I could find; the fitness function, combining several tests, was too insensitive to make any further improvements possible with the approach I used.

Conclusion

The shifts and constants above give a permutation that is statistically closer to a pseudo-random permutation than the ones used in SplitMix64, java.util.SplittableRandom and java.util.ThreadLocalRandom. In particular, the number of bad gammas/increments should be considerably fewer.

Applying the RRC-64-test to the degski64 and Lea64 mixers

In LXM: better splittable pseudorandom number generators (and almost as fast) by Guy L Steele and Sebastiano Vigna, approaches for pseud...