sqlpostgresqlrandomsampling

RANDOM() in SQL: Random Sampling, Integers, and Seeding

How RANDOM() works in PostgreSQL, why ORDER BY RANDOM() is slow at scale, and what to use instead.

2 min readReferencesql · postgresql · random · sampling · mysql

RANDOM() returns a pseudo-random floating-point value in [0, 1) and powers almost every "random" query you'll write: picking a sample user, assigning A/B buckets, or shuffling rows. Let's see how to use it well and where it becomes a bottleneck.

The base value and random rows

In PostgreSQL, RANDOM() yields a fresh number from 0 (inclusive) up to but not including 1:

SELECT random();              -- e.g. 0.8473920192
SELECT random() AS r1, random() AS r2;  -- two different values

The classic way to grab random rows is to sort by RANDOM() and take the top:

SELECT id, email, country
FROM users
ORDER BY random()
LIMIT 5;

Each row gets a random sort key, so the result differs every time. On small tables this reads cleanly and runs instantly.

Why ORDER BY RANDOM() is costly

The gotcha is cost. To sort by RANDOM(), the engine must assign a random number to every row and sort the entire table before it can apply the LIMIT. On millions of rows that's a full scan plus an on-disk sort.

  • An index on the expression can't help: the key changes on every call.
  • Cost grows with table size, not with LIMIT.

For large tables, use TABLESAMPLE, which reads random pages instead of sorting everything:

-- roughly 5% of physical pages, very cheap
SELECT id, amount, status
FROM orders TABLESAMPLE BERNOULLI (5);

-- block sampling: even faster, coarser
SELECT * FROM orders TABLESAMPLE SYSTEM (1);

Gotcha: TABLESAMPLE returns an approximate percentage, not an exact row count, and samples at the page level, so the distribution isn't perfectly uniform. When you need exactly N rows, combine the two: sample coarsely, then refine.

SELECT id, email
FROM users TABLESAMPLE SYSTEM (10)
ORDER BY random()
LIMIT 5;

Random integers

RANDOM() gives a fraction, but you often need an int in a range. The formula is floor(random() * n) for [0, n):

-- integer in [0, 100)
SELECT floor(random() * 100)::int AS n;

-- integer in [1, 6] like a dice roll
SELECT floor(random() * 6)::int + 1 AS dice;

A practical use is scattering employees into random buckets or assigning a random discount:

SELECT id, name, dept,
       floor(random() * 3)::int AS shard
FROM employees;

Gotcha: don't write round(random() * n) for a uniform integer. round pulls toward the endpoints half as often, so the edges (0 and n) get half the probability. Always use floor.

Reproducibility with setseed

Randomness is awkward for tests: results change on every run. setseed() pins the generator for the current session, taking a value in [-1, 1]:

SELECT setseed(0.42);
SELECT id FROM users ORDER BY random() LIMIT 3;  -- same order on replay

After setseed, the RANDOM() sequence is deterministic until the next setseed or the end of the session, which is handy for repeatable tests and demos.

MySQL and ClickHouse

Syntax diverges across engines:

  • MySQL uses RAND(), also [0, 1). Sampling is ORDER BY RAND() LIMIT 5; there's no TABLESAMPLE. The seed is an argument: RAND(42).
  • ClickHouse has rand(), which is a UInt32 in [0, 2^32), not a fraction. For [0,1) divide by 4294967295.0; for sampling it has a native SELECT ... SAMPLE 0.1.
-- MySQL: random rows + reproducible seed
SELECT id, email FROM users ORDER BY RAND(42) LIMIT 5;

-- ClickHouse: native sampling, no full sort
SELECT id, amount FROM orders SAMPLE 0.1;

The takeaway: ORDER BY RANDOM() is convenient and correct but doesn't scale. On large data move to page-level sampling, and keep setseed/RAND(seed) for reproducible output.

Practice on real tasks

Solve tasks in the SQL trainer with instant grading and hints.

Open trainer