fredag 3 januari 2020

NASAM: Not Another Strange Acronym Mixer!

NASAM: Not Another Strange Acronym Mixer

The various abbreviations became more and more unpronouncable. I give you "NASAM".

This is a new mixer that has been tested quite thoroughly. Since I don't have more than 2 machines at my disposal (a Ryzen 7 1700 and an i7-5820), testing takes quite some time. I will complete this posting with data on sparse gammas and other interesting things I may think of.

This is the fastest permutation function I am aware of that passes RRC-64-42-TF2-0.94. Tommy Ettinger's function Pelican also passes RRC-64-42-TF2-0.94 and has similar speed.

NASAM mixer in C.
uint64_t nasam(uint64_t x) {
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}
Pelican has one advantage over NASAM; Pelican has no fixed point at 0. There are at least three obvious ways of fixing this defect, should it be deemed necessary for a particular use-case:
  1. Adding a constant after the first multiplication ("rrma2xsm2xs" in this post).
  2. Xoring a constant prior to rotating/xoring ("xNASAM" in this post).
  3. Xoring a constant prior to rotating/xoring and after the last xor-shift ("xNASAMx" in this post).

Assuming the statistical quality of 1--3 above is at least as good as NASAM, the most interesting thing to analyze is of course speed. xNASAMx also has another property that might be of interest in some use cases.

For speed measurements, I've used Chris Wellon's shootout-program, described here.

Speed in MB/s, compiled with -Ofast -march=native. The highest speed for each mixer is marked in green, the lowest in red. The % column gives the speed relative to SplitMix64.
i7-5820Ryzen 7 1700
Mixergcc 9.2.1clang 8.0.1%gcc 9.2.1clang 8.0.1%
baseline1016510213-86398681-
splitmix6462767334100.00%64716971100.00%
rrmxmx6091606483.04%5949636191.24%
NASAM4412448961.20%4385481669.08%
Pelican4320438259.74%4256463866.53%
xNASAM4156431358.81%3959463266.45%
rrma2xsm2xs4278421458.33%4083447564.20%
xNASAMx3921406755.46%3762380654.60%

rrma2xsm2xs

The preliminary name of NASAM was rrm2xsm2xs for "2 rotations, multiply, 2 term xor-shift, multiply, 2 term xor-shift". The "rrma..." is simply an addition after the first multiply:
rrma2xsm2xs mixer in C.
uint64_t rrma2xsm2xs(uint64_t x) {
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x = x * 0x9E6C63D0676A9A99UL + C; // Avoids trivial fixpoint at 0.
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}
Special care would have to be taken for the constant C above to be used for anything interesting but avoiding the fix point at 0. In particular, it must not be used for "independent streams"; AFAICT, they will be strongly correlated due to the fact that C will come into play quite late in the mixing.

It is worth noting that the speed penalty compared to NASAM is small (4%--7%).

xNASAM

The same as NASAM with xor of a constant before the mixing steps. Adding instead of xoring would be equivalent to a linear offset in the sequence.

This is much more promising as a construction where the constant could be used for something resembling independent streams. Some care will have to be taken; I think there's a good way of segmenting a 64-bit stream index into an increment part and an xor-part for a total of $2^{64}$ distinct streams, each with period $2^{64}$.

xNASAM mixer in C.
uint64_t xnasam(uint64_t x, uint64_t c) {
  x ^= c;
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}

xNASAMx

The same as NASAM with xor of a constant before and after the mixing steps.

This has the interesting property that it to a small extent masks the counter and the increment used. The idea is similar to the XEX encryption mode, using NASAM as an unkeyed block cipher. This has zero cryptographic security, of course. Still it may be useful if you have a use case where a minor inconvience with regards to deducing the counter and/or increment is appropriate. The last xorstep makes the function harder to invert, without knowing the constant used.

This version is also the slowest of the mixers published here at 54%--56% of the speed of SplitMix and 80%--90% of the speed of NASAM. I somehow doubt that it behaves any better than xNASAM when considering using the constant to generate distinct streams.

xNASAMx mixer in C.
uint64_t xnasam(uint64_t x, uint64_t c) {
  x ^= c;
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;
  x ^= c;

  return x;
}

Below, 256 subtests of 4 TB each (1 PB total) didn't show any anomalies. This of course doesn't mean that there aren't any; just that RRC-64-42-TF2-0.94 is a too weak test to detect them.

(Lack of) NASAM failures, $2^{42}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 4242424242424242424242424242424242424242424242424242424242424242
16 4242424242424242424242424242424242424242424242424242424242424242
32 4242424242424242424242424242424242424242424242424242424242424242
48 4242424242424242424242424242424242424242424242424242424242424242
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 4242424242424242424242424242424242424242424242424242424242424242
16 4242424242424242424242424242424242424242424242424242424242424242
32 4242424242424242424242424242424242424242424242424242424242424242
48 4242424242424242424242424242424242424242424242424242424242424242

måndag 16 december 2019

Stronger, better, morer, Moremur; a better Murmur3-type mixer.

I'll keep this brief: The original constants used in MurmurHash3 and Stafford's Variant 13 are not as good as they could have been. The testing procedure I've used is described in an earlier post. There is one extension, compared to the earlier post: every rotation and reversal is also complemented for a total of 256 subtests. I'll call the extended version "RRC-64-40-TF2-0.94" for "Reversed, rotated, complemented-64 bits, 2^40 bytes, PractRand 0.94 with -tf 2". Without further ado, here's "Moremur", a statistically stronger but equally fast mixer:
uint64_t moremur(uint64_t x) {
  x ^= x >> 27;
  x *= 0x3C79AC492BA7B653UL;
  x ^= x >> 33;
  x *= 0x1C69B3F74AC4AE35UL;
  x ^= x >> 27;

  return v;
}
One strong flaw with this construction, i.e. the first operation being an xor-shift, is that there are many increments that will lead to few or none of the less significant bits being one. This in turn implies that the less significant bits will be zero after the first two operations, xor-shift and multiply. Particularly bad increments have this form: $\gamma = 2^sa + a, a < 2^s$, $s$ being the first shift and $a$ being sparse. In the table below, Moremur outperforms Murmur3 and Variant13 by quite some margin.
Table 1: Test length where PractRand fails
$\gamma$ Murmur3 Variant13 Moremur
0x0000000000000001
$2^{17}$ $2^{19}$ $2^{33}$
0x0000000000000003
$2^{17}$ $2^{17}$ $2^{34}$
0x0000000000000005
$2^{17}$ $2^{18}$ $2^{34}$
0x0000000000000009
$2^{16}$ $2^{18}$ $2^{34}$
0x0000010000000001
$2^{18}$ $2^{21}$ $2^{41}$
0xffffffffffffffff
$2^{17}$ $2^{19}$ $2^{33}$
0x0000000000ffffff
$2^{17}$ $2^{23}$ $2^{37}$
0xffffff0000000001
$2^{18}$ $2^{23}$ $2^{43}$
0x0000000000555555
$2^{18}$ $2^{28}$ $2^{42}$
0x1111111111110001
$2^{25}$ $2^{30}$ $> 2^{46}$
0x7777777777770001
$2^{28}$ $2^{34}$ $> 2^{45}$
0x7f7f7f7f33333333
$2^{30}$ $2^{33}$ $> 2^{46}$
0x5555550000000001
$2^{21}$ $2^{27}$ $2^{45}$
0xc45a11730cc8ffe3
$2^{39}$ $> 2^{42}$ $2^{?}$
0x2b13b77d0b289bbd
$2^{39}$ $> 2^{42}$ $2^{?}$
0x40ead42ca1cd0131
$2^{40}$ $> 2^{42}$ $2^{?}$
As can be seen from the tables below, these constants are a great improvement over the ones I've been aware of so far (original MurmurHash3 and Variant13). Moremur still fails RRC-64-40-TF2-0.94 but far from as badly as the earlier constants do.
Moremur failures, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3319191922222425282728291616161716171919202827283429293536373738
16 1717181919191920212223232425262736353539403431313940393837343436
32 2829303131323333343434343433323228343332323133333434343535363736
48 3228282829292929303030303131313231313133363736283334313128252219
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 3319191922222425282918281516161716171921212628303431343436363737
16 1718181919191919212223242425262736353539403431313940393837343436
32 2829303032333334343535343432333228343332323133333434343536363636
48 3228282828282929293030303131323231313133353534283232313028252219
Murmur3 failures, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1718181817161616161515171514151515171818181717161614141414141417
16 1414141414151515151516161616151517171617171717181917181817161414
32 1616171716161414161514151515151515161718181717171716141414141515
48 1514141414141515151515161717171717171717171717181819191818181716
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1717181717151616161515171415151514171818171716171615141414141617
16 1414141415151515151516161616151517171617171717181818171615151414
32 1617171716141414151514151415151415151718181717171716151414141415
48 1414141414141515151515161717171717171716171717171819191818181715
Variant 13, $2^{40}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1917181818171818181818191916171716171818181919202019202019202019
16 1717161716161717161716171717181820192118192119222021222120222219
32 1919191919191919191920201920191918181819192020192020192020202019
48 1717161717171617161717171818191921181821192220211918181818181818
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1917181818171818181818181916171716171818181919202019202019202019
16 1717161716171717161716171717171820192118182119222021222120222219
32 1919191919191919191920201920201918181819192020192020192020192019
48 1617161617171617161717171818191921191921192220211918181818181818
The constants were found through a stochastic hill climber I wrote using CUDA. The first testing I did of Murmur3 and Variant13 revealed that they had fairly weak higher order avalanche properties. Letting the fitness function include metrics for 1st to 4th order avalanche proved to be quite successful. The CUDA program was used for sieving; once some possibly good candidates turned up, I ran them through RRC-64-40-TF2-0.94. The constants above were the best I could find; the fitness function, combining several tests, was too insensitive to make any further improvements possible with the approach I used.

torsdag 24 januari 2019

Better, stronger mixer and a test procedure.


An intermediate result (much stronger mixer).

A frustrating search.

When working on finding good mixers, I got really frustrated by the extremely slow process of testing a new set of constants or other tweaks to functions that were "kind of good".
My criterion for "success" was that the output from a simple counter, starting at 0, passed to the mixing function, and then taking the output from the mixing function and feeding it to PractRand with -tf 2 should not generate any failures up to $2^{45}$ bytes.

Eventually, I came to realize that what I really wanted was not just a function that would satisfy the "generate PractRand-passing output when fed a unitary counter" but rather "a decent approximation of a random permutation". Such a permutation would give a higgledy-piggledy sequence from a very regular one, regardless of what the regularity looks like (just kidding; the regularities of interest are affine transformations, of course there are many other sequences one could think of as "regular"). Previosuly, I had been testing with feeding gray-code sequences to the mixing function, hoping that the unit Hamming distance between adjacent inputs would reveal flaws more quickly. For the functions I tested, I couldn't find that it made any difference w.r.t. how quickly testing failed.

As I was thinking about more sequences that would be "very well ordered" I came up with a very simple testing procedure that seems to work well: A function that pass the subtests described below is also very likely to work well on any kind of (affine) counter update.

The fact that many of the subtest input sequences intersect does not affect the tests since they are run independently.

Test procedure

Instead of testing the full output, which is very slow, test many short output sequences in the hopes that anomalies in the mixer will be revealed more quickly. There's quite some conjecturing here: the (lack of) failures on many short tests hopefully correlate with overall quality. If the mixer is good, this is also a very lengthy process (the way I have been doing it, 128 TB of data will be tested if there are no failures) but running many small tests (1 TB a piece) can be done in parallell on a machine with many cores but without a using a lot of RAM.

Naming

To avoid confusion, let's call this test procedure "RR-b-x-TF2-0.94". "RR" to indicate "rotate and bit reversal", "b" for the word size in input/output bits, x for the tlmax-exponent, TF2 to indicate that PractRand was run with "-tf 2" and "0.94" for the version of PractRand I am currently using. The same procedure can be used for any permutation of size $2^{8b}$. Pseudo-code for RR-b-x-TF2-0.94:
# The tests I have run have been with b = 64 and x = tlmax = 40,
# i.e. 1 TB per subtest.
# rfun is either the identity function or a b-bit bit-order reversal function.
for rfun in {identity, bitreversal}
  # Rotate counter 0 to (b - 1) bits.
  for r in {0, ..., b - 1}
    # No failure seen for this rfun & rotation so far.
    failure = 0
    # 2^x bytes, 2^x / b * 8 b-bit words.
    for ctr in {0, ..., (2^tlmax / b * 8) - 1}
      # Feed PractRand, break on failure.
      # RORb is a b-bit right rotate function.
      if feedPractRandFails(RORb(rfun(ctr), r))
        # Note the failure level.
        failure = ceil(log_2(ctr))
        break for
      end if
    end for
    emitTestScore(rfun, r, failure)
  end for
end for

Test results

MurmurHash3 finalizer

RR-64-40-TF2-0.94 fails the MurmurHash3 finalizer quickly. This is to be expected considering that it is known to fail quickly when fed a unitary counter.

MurmurHash3 failures, $2^{40}$ bytes maximum.
Offset IDENTIY
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 17181818171616161615151715141515
16 14141414141515151515161616161515
32 16161717161614141615141515151515
48 15141414141415151515151617171717
Offset REVERSE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 15171818181717161614141414141417
16 17171617171717181917181817161414
32 15161718181717171716141414141515
48 17171717171717181819191818181716

MurmurHash3, Variant 13

RR-64-40-TF2-0.94 fails Variant 13 quickly too. This is also to be expected considering that it is known to fail almost as fast as the MurmurHash3 finalizer.
MurmurHash3/Variant 13 failures, $2^{40}$ bytes maximum.
Offset IDENTIY
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 19171818181718181818181919161717
16 17171617161617171617161717171818
32 19191919191919191919202019201919
48 17171617171716171617171718181919
Offset REVERSE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 16171818181919202019202019202019
16 20192118192119222021222120222219
32 18181819192020192020192020202019
48 21181821192220211918181818181818

Ettinger's mixer

Tommy Ettinger came up with a quite different construction, compared to MurmurHash3 (and variants) and my own rrmxmx (in C):
uint64_t ettingerMixer(uint64_t v) {
  uint64_t z = (v ^ 0xDB4F0B9175AE2165UL) * 0x4823A80B2006E21BUL;
  z ^= rol64(z, 52) ^ rol64(z, 21) ^ 0x9E3779B97F4A7C15UL;
  z *= 0x81383173UL;
  return z ^ z >> 28;
}
According to him it passes 32 TB of PractRand but with -tf unspecificed, defaulting to 1 instead of 2 which is what I use for all tests. I was a bit too quick here and tested his function with RIGHT-rotates and not LEFT-rotates which is what Ettinger used. The table below is now updated to reflect a correct implementation of his mixer. The performace w.r.t. RR-64-40-TF2-0.94, compared to my incorrect implementation, is about the same. 122 failed tests out of 128:
"Ettinger's mixer" failures, $2^{40}$ bytes maximum.
Offset IDENTIY
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 40161615141516161512121314151415
16 15161516161718191717171718171718
32 19202123242526272828303032241916
48 16191919222224314040404036344040
Offset REVERSE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 12131514151515151616161616161617
16 17181919222020172325252526303029
32 30343336211917161716161721211919
48 22222120212019201918161616151514
So this is a special case where an increment of one works particularly well. It would be interesting to know at what level it fails PractRand (with -tf 2, if at all) with a unitary counter.

Changing the rotation constants to {12, 43} (corresponding to right rotates instead of left rotatets) gives a function with about the same number of failures, 120 out of 128:
"Ettinger's mixer, ROR-version", $2^{40}$ bytes maximum.
Offset IDENTIY
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 40151514151516161515151515151515
16 16161616161616181817181718191918
32 17181921231917171919192124283135
48 36373834363635384040404040404040
Offset REVERSE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 14141415151515161616171617171716
16 18171716242322212425252219171917
32 16161719191921222527262524232221
48 20191817171717171717171716161515

Testing rrmxmx

I already knew that rrmxmx fails PractRand -tf 2 -te 1 after $2^{44}$ bytes when fed a unitary counter, starting at 0. If I recall correctly that took about one week of computing time to find out. The table below took about 20 minutes to generate and it clearly shows that rrmxmx is far from optimal (note that x = 33, quick and dirty):
The original rrmxmx, $2^{33}$ bytes maximum.
Offset IDENTIY
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 33333333333333333325253333333333
16 33283333333333333333333333333333
32 33313333333330303033333333333333
48 33333333333333333333333333333333
Offset REVERSE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 33333333333333333333312831333333
16 33333333333333333333332833333333
32 28283025223333333333333333333333
48 33333333333333333333333333333333
The somewhat good news to anyone who decides to use it are that an increment of 1 is the only odd integer I have found which fails PractRand. The even better news is that a variation where the first ror-ror-xor and first shift-xor are interchanged in general performs better. For the constants I have tried so far, it still fails RR-64-x-TF2-0.94 but to a lesser extent. Which leads us to the latest and greatest. Even more rotates.

Intermediate result, my best so far

Compared to my previous function, rrmxmx, this is much more well behaved and has passed 128 TB of PractRand -tf 2 with a unitary counter (one single run, inputs $\{0, \ldots, 2^{44} - 1\}$). Testing with RR-64-40-TF2-0.94, there is 1 failure:
rrxmrrxmsx_0 failures, $2^{40}$ bytes maximum.
Offset IDENTIY
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 40404040404040404040404040404040
16 40404040404040404040404040404040
32 40404040404040404040404040404040
48 40404040404040404040404040404040
Offset REVERSE
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 40404040404040404040404040403940
16 40404040404040404040404040404040
32 40404040404040404040404040404040
48 40404040404040404040404040404040
Trying to fix this failure (ROR(reversed, 14)) has so far only given me variants with more failures or a single failure happening in a different subtest with less than $2^{39}$ bytes.
Compared to rrmxmx it is slightly slower, doing 2 rotations instead of the shift in the middle of rrmxmx. Temporarily, let's call it "rrxmrrxmsx_0" (in C):
// Old mixer, my rrmxmx
uint64_t rrmxmx(uint64_t v) {
  v ^= ror64(v, 49) ^ ror64(v, 24);
  v *= 0x9FB21C651E98DF25L;
  v ^= v >> 28;
  v *= 0x9FB21C651E98DF25L;
  return v ^ v >> 28;
}

// New mixer, "rrxmrrxmsx_0", failing one subtest of RR-64-40-TF2-0.94.
// With a unit counter starting at 0, it has passed 128 TB of
// PractRand 0.94 -tf 2 without anomalies found past 2 TB.
uint64_t rrxmrrxmsx_0(uint64_t v) {
  v ^= ror64(v, 25) ^ ror64(v, 50);
  v *= 0xA24BAED4963EE407UL;
  v ^= ror64(v, 24) ^ ror64(v, 49);
  v *= 9FB21C651E98DF25UL;
  return v ^ v >> 28;
}
I found this function a few months ago but have been hoping to find something of similar speed that doesn't fail any of the 128 subtests. If I do find one, I'll post it here!

I kind of gave up on finding better constants for functions having the same form as the original rrmxmx. To check that there really is a reasonable correlation between "quite good behaviour" when feeding the function with a unitary counter and not failing many subtests, I am running RR-64-40-TF2-0.94 on it. It fails much much less badly than MurmurHash3 and Variant13 but still much worse than rrxmrrxmxs.

Conclusion

Running RR-b-x-TF2-0.94 can be relatively quick way of detecting if a particular 64-bit permutation function is a good approximation of a random permutation. The approach can be extended to any permutation of size $2^{8b}$. As a preliminary step, one could lower x significantly. In case the function under test gives many failures at, say, x = 30, it may not be worthwhile to examine the number of failures at x = 40 or above. Passing the test is a necessary but not sufficient requirement for any function that is expected to behave like a random permutation. There are probably bad permutations that do pass RR-b-x-TF2 but no good ones that does not pass it.

NASAM: Not Another Strange Acronym Mixer!

NASAM: Not Another Strange Acronym Mixer The various abbreviations became more and more unpronouncable. I give you "NASAM". T...