# Introduction

`xorshift*`

generators are very fast, high-quality PRNGs (pseudorandom number
generators) obtained by scrambling the output of a Marsaglia `xorshift`

generator with a 64-bit invertible multiplier (as suggested by
Marsaglia in his paper). They are an excellent choice for all
non-cryptographic applications, as they are incredibly fast, have long
periods and their output passes strong statistical test suites.

`xorshift+`

generators are a 64-bit version of Saito and Matsumoto's `XSadd`

generator. Instead of using a multiplication, we return the sum (in **Z**/2^{64}**Z**)
of two consecutive output of a `xorshift`

generator. The reverse of `XSadd`

fails systematically several BigCrush
tests, whereas our `xorshift128+`

generator generates a top-quality 64-bit value at
incredible speed (see below). While its period is too short for
large-scale parallel simulations (use `xorshift1024*`

for
that purpose), it is the perfect replacement for low-quality generators commonly
found in programming languagues.

# A PRNG Shootout

This page contains a shootout of a few recent PRNGs (pseudorandom number generators) that are quite widely used. The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. The code used to perform the tests and all the output from statistical test suites is available for download.

## Speed

The speed reported in this page is the time required to emit 64 random bits. If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators (which will happen, ultimately). The tests were performed on an Intel® Core™ i7-4770 CPU @3.40GHz (Haswell).

A few *caveats*:

- Timings are taken running a generator for billions of times in a loop; but this is not the way you use generators.
- There is some looping overhead, which is about 0.25 ns, but subtracting it from the timings is not going to be particularly meaningful due to instruction rescheduling, etc.
- Relative speed might be different on different CPUs and on different scenarios.
- Code has been compiled using
`gcc`

's`-fno-move-loop-invariants`

and`-fno-unroll-loops`

options. These options are*essential*to get a sensible result: without them, the compiler can move outside the testing loop constant loads (e.g., multiplicative constants) and may perform different loop unrolling depending on the generator. For this reason, we cannot provide timings with`clang`

: there are at the time of this writing no such options. If you find timings that are significantly better than those shown here on comparable hardware, they are likely to be unrealiable and just due to compiler artifacts.

To ease replicability, I distribute a *harness* performing the measurement. You just
have to define a `next()`

function and include the harness. But the only realistic
suggestion is to try different generators in your application and see what happens.

## Quality

This is probably the more elusive property
of a PRNG. Here quality is measured using the powerful
BigCrush suite of tests. BigCrush is part of TestU01,
a monumental framework for testing PRNGs developed by Pierre L'Ecuyer
and Richard Simard (“TestU01: A C library for empirical testing
of random number generators”, *ACM Trans. Math. Softw.*
33(4), Article 22, 2007).

We run BigCrush starting from 100 equispaced points of the state space
of the generator and collect *failures*—tests in which the
*p*-value statistics is outside the interval [0.001..0.999]. A failure
is *systematic* if it happens at all points.

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values
are passed to the test suite for each generated 64-bit value. Floating point numbers
are generated instead by dividing the unsigned output of the generator by 2^{64}.
Since this implies a bias towards the high bits (which is anyway a known characteristic
of TestU01), we run the test suite also on the *reverse* generator and
add up the resulting failures.

More detail about the whole process can be found in this paper.

PRNG | Period | Failures (std) | Failures (rev) | Overall | Systematic | ns/64 bits |
---|---|---|---|---|---|---|

`xorshift1024*` | 2^{1024} − 1 | 29 | 22 | 51 | — | 1.34 |

`xorshift128+` | 2^{128} − 1 | 34 | 30 | 64 | — | 1.12 |

`xorshift4096*` | 2^{4096} − 1 | 33 | 34 | 67 | — | 1.34 |

`Ran` | ≈2^{191} | 32 | 42 | 74 | — | 2.06 |

`MCG128` | 2^{127} | 39 | 38 | 77 | — | 1.31 |

`xorgens4096` | 2^{4096} − 1 | 42 | 40 | 82 | — | 1.83 |

SplitMix64 | 2^{64} | 35 | 45 | 80 | — | 1.93 |

`xorshift64*` | 2^{64} − 1 | 230 | 133 | 363 | MatrixRank | 1.56 |

`MT19937-64` (Mersenne Twister) | 2^{19937} − 1 | 258 | 258 | 516 | LinearComp | 2.66 |

`xorshift4096` | 2^{4096} − 1 | 326 | 333 | 659 | MatrixRank, LinearComp | 1.18 |

`xorshift1024` | 2^{1024} − 1 | 440 | 442 | 882 | MatrixRank, LinearComp | 1.18 |

`WELL1024a` | 2^{1024} − 1 | 441 | 441 | 882 | MatrixRank, LinearComp | 5.54 |

`XSadd` | 2^{128} − 1 | 38 | 850 | 888 | LinearComp, MatrixRank, MaxOft, Permutation | 2.06 |

`xorshift64` | 2^{64} − 1 | 762 | 750 | 1512 | BirthdaySpacings, MatrixRank, LinearComp | 1.57 |

`java.util.Random` | 2^{48} − 1 | 4078 | 9486 | 13564 | Almost all tests | 2.60 |

I'll be happy to add to the list any PRNG that can improve significantly on the ones already listed. You can also install TestU01, download my code and start from there to do your own tests.

# Remarks

## A long period does not imply high quality

This is a common misconception. The generator `x++`

has
period \(2^k\), for any \(k\geq0\), provided that `x`

is
represented using \(k\) bits: nonetheless, it is a horrible generator.
The generator returning \(k-1\) zeroes followed by a one has period
\(k\).

It is however important that the period is long enough. A first heuristic rule of thumb is that if you need to use \(t\) values, you need a generator with period at least \(t^2\). Moreover, if you run \(n\) independent computations starting at random seeds, the sequences used by each computation should not overlap. We can stay on the safe side and require that the period is long enough so that the probability that \(n\) sequences of \(t^2\) elements starting at random positions overlap is very low.

Now, given a generator with period \(P\), the probability that \(n\) subsequences of length \(L\) starting at random points in the state space overlap is \[ 1 - \left( 1 - \frac{nL}{P-1}\right)^{n-1} \approx 1 - \left(e^{-Ln}\right)^{\frac{n-1}{P-1}} \approx \frac{Ln^2}P, \] assuming that \(P\) is large and \(nL/P\) is close to zero. If your generator has period \(2^{512}\) and you run on \(2^{100}\) cores (you will never have them) a computation using \(2^{100}\) pseudorandom numbers (you'll never have the time) the probability of overlap would be less than \(2^{-112}\).

In other words: any generator with a period beyond, say, \(2^{1024}\) (just to stay again on the safe side) has a period is sufficient for every imaginable application. Unless there are other motivations (e.g., provably increased quality), a generator with a larger period is only a waste of memory (as it needs a large state space), of cache lines, and of precious high-entropy random bits for seeding (unless you're using small seeds, but then it's not clear why you would want a very long period in the first place—the computation above is valid only if you seed all bits of the state space with independent, uniformly distributed random bits).

## Equidistribution

A `xorshift*`

generator with an `n`-bit state space is `n`/64-dimensionally equidistributed: every
`n`/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the
best possible for 64-bit values).
A `xorshift+`

generator is however only (`n`/64 − 1)-dimensionally equidistributed: every
(`n`/64 − 1)-tuple of consecutive 64-bit values appears exactly 2^{64} times in the output, except for
a missing zero tuple.

## How can a `xorshift64*`

generator be *slower* than a `xorshift1024*`

generator?

Dependencies. The three xor/shifts of a `xorshift64*`

generator must be executed sequentially, as
each one is dependent on the result of the previous one. In a `xorshift1024*`

generator two of the
xor/shifts are completely independent and can be parallelized internally by the CPU. I also suspect that
the larger state space makes it possible for the CPU to perform more aggressively speculative execution
(indeed, a `xorshift128*`

generator is slower than a `xorshift1024*`

generator).

## Why just sizes 64, 128, 1024 and 4096?

First of all, using powers of two in high dimension is advantageous because the modulo operation used to update the cyclic counter that simulates the block shift of the whole state space becomes a logical and.

**64** is the smallest possible size if you use 64-bit shifts, and it
makes it very easy to have a generator embedded in a Java object without
creating an additional object. A generator with 64 bits of state can also be
used to generate the seed for a generator with a larger state. We suggest
in this case to use a SplitMix64 generator.

**128** works very well because you can just swap the two halves of the state
space instead of keeping a circular counter. This makes for the incredible speed of a `xorshift128+`

generator,
which is a very reasonable choice for a general-purpose generator if no parallelism is involved.
Also in this case it is very easy to embed the generator without creating an additional Java object,
as the state space can be represented by two `long`

variables.

**1024** seems the most reasonable choice for a general-purpose generator for massively
parallel applications: it is large enough, but not uselessly large.

**4096** is the largest currently possible state space (this will change
when we will factor larger Fermat's numbers), so it's nice to have it
here even if it is not really useful.

## What about the generator used in C, say by `gcc`

?

Applying the same tests from the C API is impossible: the generator
is not standardized, `rand()`

returns a weird number of bits (usually 31) and there's
not way to manipulate directly the state of the generator.
Traditionally, C standard libraries used a linear congruential
generator similar to `java.util.Random`

. Apparently, recent
versions of `glibc`

use a trinomial-based linear-feedback shift
register or even the SIMD version of `MT19937`

.