# Introduction

xorshift* generators are very fast, high-quality PRNGs (pseudorandom number generators) obtained by scrambling the output of a Marsaglia xorshift generator with a 64-bit invertible multiplier (as suggested by Marsaglia in his paper). They are an excellent choice for all non-cryptographic applications, as they are incredibly fast, have long periods and their output passes strong statistical test suites.

xorshift+ generators are a 64-bit version of Saito and Matsumoto's XSadd generator. Instead of using a multiplication, we return the sum (in Z/264Z) of two consecutive output of a xorshift generator. The reverse of XSadd fails systematically several BigCrush tests, whereas our xorshift128+ generator generates a top-quality 64-bit value at incredible speed (see below). While its period is too short for large-scale parallel simulations (use xorshift1024* for that purpose), it is the perfect replacement for low-quality generators commonly found in programming languagues.

# ﻿A PRNG Shootout

This page contains a shootout of a few recent PRNGs (pseudorandom number generators) that are quite widely used. The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. The code used to perform the tests and all the output from statistical test suites is available for download.

## ﻿Speed

The speed reported in this page is the time required to emit 64 random bits. If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators (which will happen, ultimately). The tests were performed on an Intel® Core™ i7-4770 CPU @3.40GHz (Haswell).

A few caveats:

• Timings are taken running a generator for billions of times in a loop; but this is not the way you use generators.
• There is some looping overhead, which is about 0.25 ns, but subtracting it from the timings is not going to be particularly meaningful due to instruction rescheduling, etc.
• Relative speed might be different on different CPUs and on different scenarios.
• Code has been compiled using gcc's -fno-move-loop-invariants and -fno-unroll-loops options. These options are essential to get a sensible result: without them, the compiler can move outside the testing loop constant loads (e.g., multiplicative constants) and may perform different loop unrolling depending on the generator. For this reason, we cannot provide timings with clang: there are at the time of this writing no such options. If you find timings that are significantly better than those shown here on comparable hardware, they are likely to be unreliable and just due to compiler artifacts.

To ease replicability, I distribute a harness performing the measurement. You just have to define a next() function and include the harness. But the only realistic suggestion is to try different generators in your application and see what happens.

## ﻿Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007).

We run BigCrush starting from 100 equispaced points of the state space of the generator and collect failures—tests in which the p-value statistics is outside the interval [0.001..0.999]. A failure is systematic if it happens at all points.

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values are passed to the test suite for each generated 64-bit value. Floating point numbers are generated instead by dividing the unsigned output of the generator by 264. Since this implies a bias towards the high bits (which is anyway a known characteristic of TestU01), we run the test suite also on the reverse generator and add up the resulting failures.

More detail about the whole process can be found in this paper.

PRNG Period Failures (std) Failures (rev) Overall Systematic ns/64 bits
xorshift1024* 21024 − 129 22 511.34
xorshift128+ 2128 − 134 30 641.12
xorshift116+ 2116 − 133 33 66
xorshift4096* 24096 − 133 34 671.34
Ran ≈219132 42 742.06
MCG128 212739 38 771.31
xorgens4096 24096 − 142 40 821.83
SplitMix64 26435 45 801.93
xorshift64* 264 − 1230 133 363MatrixRank1.56
MT19937-64 (Mersenne Twister) 219937 − 1 258 258 516LinearComp2.66
xorshift4096 24096 − 1 326 333 659MatrixRank, LinearComp1.18
xorshift1024 21024 − 1 440 442 882MatrixRank, LinearComp1.18
WELL1024a 21024 − 1 441 441 882MatrixRank, LinearComp5.54
XSadd 2128 − 1 38 850 888LinearComp, MatrixRank, MaxOft, Permutation2.06
xorshift64 264 − 1762 750 1512BirthdaySpacings, MatrixRank, LinearComp1.57
java.util.Random 248 − 1 4078 9486 13564Almost all tests2.60

I'll be happy to add to the list any PRNG that can improve significantly on the ones already listed. You can also install TestU01, download my code and start from there to do your own tests.

# ﻿Remarks

## A long period does not imply high quality

This is a common misconception. The generator x++ has period $$2^k$$, for any $$k\geq0$$, provided that x is represented using $$k$$ bits: nonetheless, it is a horrible generator. The generator returning $$k-1$$ zeroes followed by a one has period $$k$$.

It is however important that the period is long enough. A first heuristic rule of thumb is that if you need to use $$t$$ values, you need a generator with period at least $$t^2$$. Moreover, if you run $$n$$ independent computations starting at random seeds, the sequences used by each computation should not overlap. We can stay on the safe side and require that the period is long enough so that the probability that $$n$$ sequences of $$t^2$$ elements starting at random positions overlap is very low.

Now, given a generator with period $$P$$, the probability that $$n$$ subsequences of length $$L$$ starting at random points in the state space overlap is $1 - \left( 1 - \frac{nL}{P-1}\right)^{n-1} \approx 1 - \left(e^{-Ln}\right)^{\frac{n-1}{P-1}} \approx \frac{Ln^2}P,$ assuming that $$P$$ is large and $$nL/P$$ is close to zero. If your generator has period $$2^{512}$$ and you run on $$2^{100}$$ cores (you will never have them) a computation using $$2^{100}$$ pseudorandom numbers (you'll never have the time) the probability of overlap would be less than $$2^{-112}$$.

In other words: any generator with a period beyond, say, $$2^{1024}$$ (just to stay again on the safe side) has a period is sufficient for every imaginable application. Unless there are other motivations (e.g., provably increased quality), a generator with a larger period is only a waste of memory (as it needs a large state space), of cache lines, and of precious high-entropy random bits for seeding (unless you're using small seeds, but then it's not clear why you would want a very long period in the first place—the computation above is valid only if you seed all bits of the state space with independent, uniformly distributed random bits).

## Equidistribution

A xorshift* generator with an n-bit state space is n/64-dimensionally equidistributed: every n/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the best possible for 64-bit values). A xorshift+ generator is however only (n/64 − 1)-dimensionally equidistributed: every (n/64 − 1)-tuple of consecutive 64-bit values appears exactly 264 times in the output, except for a missing zero tuple.

## How can a xorshift64* generator be slower than a xorshift1024* generator?

Dependencies. The three xor/shifts of a xorshift64* generator must be executed sequentially, as each one is dependent on the result of the previous one. In a xorshift1024* generator two of the xor/shifts are completely independent and can be parallelized internally by the CPU. I also suspect that the larger state space makes it possible for the CPU to perform more aggressively speculative execution (indeed, a xorshift128* generator is slower than a xorshift1024* generator).

## Why just sizes 64, 128, 1024 and 4096?

First of all, using powers of two in high dimension is advantageous because the modulo operation used to update the cyclic counter that simulates the block shift of the whole state space becomes a logical and.

64 is the smallest possible size if you use 64-bit shifts, and it makes it very easy to have a generator embedded in a Java object without creating an additional object. A generator with 64 bits of state can also be used to generate the seed for a generator with a larger state. We suggest in this case to use a SplitMix64 generator.

128 works very well because you can just swap the two halves of the state space instead of keeping a circular counter. This makes for the incredible speed of a xorshift128+ generator, which is a very reasonable choice for a general-purpose generator if no parallelism is involved. Also in this case it is very easy to embed the generator without creating an additional Java object, as the state space can be represented by two long variables.

1024 seems the most reasonable choice for a general-purpose generator for massively parallel applications: it is large enough, but not uselessly large.

4096 is the largest currently possible state space (this will change when we will factor larger Fermat's numbers), so it's nice to have it here even if it is not really useful.

## What about the generator used in C, say by gcc?

Applying the same tests from the C API is impossible: the generator is not standardized, rand() returns a weird number of bits (usually 31) and there's not way to manipulate directly the state of the generator. Traditionally, C standard libraries used a linear congruential generator similar to java.util.Random. Apparently, recent versions of glibc use a trinomial-based linear-feedback shift register or even the SIMD version of MT19937.