fredag 3 januari 2020

NASAM: Not Another Strange Acronym Mixer!

NASAM: Not Another Strange Acronym Mixer

The various abbreviations became more and more unpronouncable. I give you "NASAM".

This is a new mixer that has been tested quite thoroughly. Since I don't have more than 2 machines at my disposal (a Ryzen 7 1700 and an i7-5820), testing takes quite some time. I will complete this posting with data on sparse gammas and other interesting things I may think of.

This is the fastest permutation function I am aware of that passes RRC-64-42-TF2-0.94. Tommy Ettinger's function Pelican also passes RRC-64-42-TF2-0.94 and has similar speed.

NASAM mixer in C.
uint64_t nasam(uint64_t x) {
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}
Pelican has one advantage over NASAM; Pelican has no fixed point at 0. There are at least three obvious ways of fixing this defect, should it be deemed necessary for a particular use-case:
  1. Adding a constant after the first multiplication ("rrma2xsm2xs" in this post).
  2. Xoring a constant prior to rotating/xoring ("xNASAM" in this post).
  3. Xoring a constant prior to rotating/xoring and after the last xor-shift ("xNASAMx" in this post).

Assuming the statistical quality of 1--3 above is at least as good as NASAM, the most interesting thing to analyze is of course speed. xNASAMx also has another property that might be of interest in some use cases.

For speed measurements, I've used Chris Wellon's shootout-program, described here.

Speed in MB/s, compiled with -Ofast -march=native. The highest speed for each mixer is marked in green, the lowest in red. The % column gives the speed relative to SplitMix64.
i7-5820Ryzen 7 1700
Mixergcc 9.2.1clang 8.0.1%gcc 9.2.1clang 8.0.1%
baseline1016510213-86398681-
splitmix6462767334100.00%64716971100.00%
rrmxmx6091606483.04%5949636191.24%
NASAM4412448961.20%4385481669.08%
Pelican4320438259.74%4256463866.53%
xNASAM4156431358.81%3959463266.45%
rrma2xsm2xs4278421458.33%4083447564.20%
xNASAMx3921406755.46%3762380654.60%

rrma2xsm2xs

The preliminary name of NASAM was rrm2xsm2xs for "2 rotations, multiply, 2 term xor-shift, multiply, 2 term xor-shift". The "rrma..." is simply an addition after the first multiply:
rrma2xsm2xs mixer in C.
uint64_t rrma2xsm2xs(uint64_t x) {
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x = x * 0x9E6C63D0676A9A99UL + C; // Avoids trivial fixpoint at 0.
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}
Special care would have to be taken for the constant C above to be used for anything interesting but avoiding the fix point at 0. In particular, it must not be used for "independent streams"; AFAICT, they will be strongly correlated due to the fact that C will come into play quite late in the mixing.

It is worth noting that the speed penalty compared to NASAM is small (4%--7%).

xNASAM

The same as NASAM with xor of a constant before the mixing steps. Adding instead of xoring would be equivalent to a linear offset in the sequence.

This is much more promising as a construction where the constant could be used for something resembling independent streams. Some care will have to be taken; I think there's a good way of segmenting a 64-bit stream index into an increment part and an xor-part for a total of $2^{64}$ distinct streams, each with period $2^{64}$.

xNASAM mixer in C.
uint64_t xnasam(uint64_t x, uint64_t c) {
  x ^= c;
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;

  return x;
}

xNASAMx

The same as NASAM with xor of a constant before and after the mixing steps.

This has the interesting property that it to a small extent masks the counter and the increment used. The idea is similar to the XEX encryption mode, using NASAM as an unkeyed block cipher. This has zero cryptographic security, of course. Still it may be useful if you have a use case where a minor inconvience with regards to deducing the counter and/or increment is appropriate. The last xorstep makes the function harder to invert, without knowing the constant used.

This version is also the slowest of the mixers published here at 54%--56% of the speed of SplitMix and 80%--90% of the speed of NASAM. I somehow doubt that it behaves any better than xNASAM when considering using the constant to generate distinct streams.

xNASAMx mixer in C.
uint64_t xnasamx(uint64_t x, uint64_t c) {
  x ^= c;
  // ror64(a, r) is a 64-bit rotation of a by r bits.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;
  x ^= c;

  return x;
}

Below, 256 subtests of 4 TB each (1 PB total) didn't show any anomalies. This of course doesn't mean that there aren't any; just that RRC-64-42-TF2-0.94 is a too weak test to detect them.

(Lack of) NASAM failures, $2^{42}$ bytes maximum.
ror64(reverse(x, R) ^ 0x0000000000000000, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 4242424242424242424242424242424242424242424242424242424242424242
16 4242424242424242424242424242424242424242424242424242424242424242
32 4242424242424242424242424242424242424242424242424242424242424242
48 4242424242424242424242424242424242424242424242424242424242424242
ror64(reverse(x, R) ^ 0xFFFFFFFFFFFFFFFF, r)
FORWARDREVERSED
Offset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 4242424242424242424242424242424242424242424242424242424242424242
16 4242424242424242424242424242424242424242424242424242424242424242
32 4242424242424242424242424242424242424242424242424242424242424242
48 4242424242424242424242424242424242424242424242424242424242424242

8 kommentarer:

  1. Hej Pelle! I have a need for fast lookups and insertions of 64bit integer keys in hashtables. As such I've been reading your awesome articles on 64bit mixing functions.

    It's possible to construct a lot of different mixing functions, but for practical applications there is a trade off between quality of the hash function and speed. We can use a zero overhead identiy hash function or use a very strong cryptographic hash function at the other end of the speed spectrum. Depending on the hash function quality we will see different collision rates in a hash table. There is some application dependent optimal trade off between collision rate and hashing speed that maximizes performance. The applications optimal hash functions must lie on the Pareto frontier. As such you can construct this Pareto frontier and discard any mixing function that is strictly dominated.

    I would love to see a article analyzing the Pareto efficiency of all the mixing functions you have presented.

    Thank you for your interesting articles!

    SvaraRadera
    Svar
    1. Thanks for your kind words. Indeed it is a trade off.

      IIRC, someone already did just that, tracing the Pareto frontier with NASAM included but I can't seem to find the URL now. It *might* have been Tommy Ettinger but Google doesn't help me right now.

      Radera
    2. I found this paper (https://bigdata.uni-saarland.de/publications/p249-richter.pdf) were they show that even the overhead of Murmur finalizer over multiply-and-shift is not always worth it.

      Radera
    3. I'm using a linear probing hash table with back shift deletion (https://github.com/rigtorp/HashMap) and have been using the Murmur3 mixer to hash 64bit integers. Since it's a linear probing hash table it's important to have a good hash.

      I tried using your improved moremur, splitmix, rrmxmx and in addition hardware assisted crc32:
      struct Hash {
      size_t operator()(uint64_t h) const noexcept {
      return _mm_crc32_u64(0, h);
      }
      };

      It turns out that crc32 results in slightly better performance on my workload. The other ones gives identical performance. In the end it seems that hash table performance is not very sensitive to the choice of hash function, it just needs to be good enough.

      I tired using murmur but replacing the constants with truncated values of sqrt(2) and the golden ratio and it's just as good:
      struct Hash {
      // primes near 32 + sqrt(2) + golden ratio
      size_t operator()(uint64_t x) const noexcept {
      x ^= x >> 23;
      x *= 0xf553b2c6e459cdafUL;
      x ^= x >> 29;
      x *= 0x8c579260172965b1UL;
      x ^= x >> 31;
      return x;
      }
      };

      It seems that the xor-shift-mult construction is quite robust to the choice of constants. Using shifts that are close to 32 and constants that are random seems to result in a decent hash function.

      Radera
  2. Also have you looked at constructions using carry less multiplication (https://www.felixcloutier.com/x86/pclmulqdq) like https://github.com/lemire/clhash ?

    SvaraRadera
    Svar
    1. The operations used are restricted since I don't want to use any x86-intrinsics. Regardless of hardware/software platform, the performance should be at least decent.

      Maybe I should post a generator that uses 2 or 3 rounds of AES that at least on Ryzen is very fast (faster than Romu2jr, SFC64, Xoroshiro*) with a guaranteed period 2^128 and about 0.35 cpb on a Ryzen 1700X.

      I did some basic experimenting with AES but Other Parts Of Life (most likely "my day job") came in between. I'll post some snippets with benchmark data as soon as I've verified that I didn't abandon the experiments due to statistical failures.



      Radera
  3. Hello Pelle! Thanks a lot for your work on mixers (and for this blog), I enjoyed it very much.

    I have a weird question: is there any "guaranteed" avalanche in NASAM/xNASAM (or any other good mixer) -- that for any x, mix(x) is at least N bits different from x? NASAM has a fixpoint at 0, but how many other "bad" inputs exist?

    SvaraRadera
    Svar
    1. Glad that you enjoy it!

      In short: for NASAM, there are no guarantees except that it passes RRC-64-42-TF2-0.94. It does fail RRC-64-44-TF2-0.94 (1 failure at 13TB).

      It's anyones guess if there are other fixpoints than 0. I would expect there to be as many as would be expected from a random permutation on [1, 2^64).

      I also have an test (to be published) that runs on GPUs that flags NASAM as a failure after ~800PB analysed.
      That test also fails Pelican after ~1.7PB analysed - on a GTX1060 this happens after ~45 minutes of testing.

      There's no indication that the GPU test fails generators based on a permutation of a simple counter; an ad-hoc strengthened version of NASAM shows no signs of failing after having analysed ~790PB.
      One can conclude that NASAM does have some structure that wouldn't be expected from a random permutation (as does Pelican, but to a much larger extent).

      Radera