RANDOM() returns a pseudo-random floating-point value in [0, 1) and powers almost every "random" query you'll write: picking a sample user, assigning A/B buckets, or shuffling rows. Let's see how to use it well and where it becomes a bottleneck.
The base value and random rows
In PostgreSQL, RANDOM() yields a fresh number from 0 (inclusive) up to but not including 1:
SELECT random();
SELECT random() AS r1, random() AS r2;
The classic way to grab random rows is to sort by RANDOM() and take the top:
SELECT id, email, country
FROM users
ORDER BY random()
LIMIT 5;
Each row gets a random sort key, so the result differs every time. On small tables this reads cleanly and runs instantly.
Why ORDER BY RANDOM() is costly
The gotcha is cost. To sort by RANDOM(), the engine must assign a random number to every row and sort the entire table before it can apply the LIMIT. On millions of rows that's a full scan plus an on-disk sort.
- An index on the expression can't help: the key changes on every call.
- Cost grows with table size, not with
LIMIT.
For large tables, use TABLESAMPLE, which reads random pages instead of sorting everything:
SELECT id, amount, status
FROM orders TABLESAMPLE BERNOULLI (5);
SELECT * FROM orders TABLESAMPLE SYSTEM (1);
Gotcha: TABLESAMPLE returns an approximate percentage, not an exact row count, and samples at the page level, so the distribution isn't perfectly uniform. When you need exactly N rows, combine the two: sample coarsely, then refine.
SELECT id, email
FROM users TABLESAMPLE SYSTEM (10)
ORDER BY random()
LIMIT 5;
Random integers
RANDOM() gives a fraction, but you often need an int in a range. The formula is floor(random() * n) for [0, n):
SELECT floor(random() * 100)::int AS n;
SELECT floor(random() * 6)::int + 1 AS dice;
A practical use is scattering employees into random buckets or assigning a random discount:
SELECT id, name, dept,
floor(random() * 3)::int AS shard
FROM employees;
Gotcha: don't write round(random() * n) for a uniform integer. round pulls toward the endpoints half as often, so the edges (0 and n) get half the probability. Always use floor.
Reproducibility with setseed
Randomness is awkward for tests: results change on every run. setseed() pins the generator for the current session, taking a value in [-1, 1]:
SELECT setseed(0.42);
SELECT id FROM users ORDER BY random() LIMIT 3;
After setseed, the RANDOM() sequence is deterministic until the next setseed or the end of the session, which is handy for repeatable tests and demos.
MySQL and ClickHouse
Syntax diverges across engines:
- MySQL uses
RAND(), also [0, 1). Sampling is ORDER BY RAND() LIMIT 5; there's no TABLESAMPLE. The seed is an argument: RAND(42).
- ClickHouse has
rand(), which is a UInt32 in [0, 2^32), not a fraction. For [0,1) divide by 4294967295.0; for sampling it has a native SELECT ... SAMPLE 0.1.
SELECT id, email FROM users ORDER BY RAND(42) LIMIT 5;
SELECT id, amount FROM orders SAMPLE 0.1;
The takeaway: ORDER BY RANDOM() is convenient and correct but doesn't scale. On large data move to page-level sampling, and keep setseed/RAND(seed) for reproducible output.
RANDOM()returns a pseudo-random floating-point value in[0, 1)and powers almost every "random" query you'll write: picking a sample user, assigning A/B buckets, or shuffling rows. Let's see how to use it well and where it becomes a bottleneck.The base value and random rows
In PostgreSQL,
RANDOM()yields a fresh number from 0 (inclusive) up to but not including 1:SELECT random(); -- e.g. 0.8473920192 SELECT random() AS r1, random() AS r2; -- two different valuesThe classic way to grab random rows is to sort by
RANDOM()and take the top:SELECT id, email, country FROM users ORDER BY random() LIMIT 5;Each row gets a random sort key, so the result differs every time. On small tables this reads cleanly and runs instantly.
Why ORDER BY RANDOM() is costly
The gotcha is cost. To sort by
RANDOM(), the engine must assign a random number to every row and sort the entire table before it can apply theLIMIT. On millions of rows that's a full scan plus an on-disk sort.LIMIT.For large tables, use
TABLESAMPLE, which reads random pages instead of sorting everything:-- roughly 5% of physical pages, very cheap SELECT id, amount, status FROM orders TABLESAMPLE BERNOULLI (5); -- block sampling: even faster, coarser SELECT * FROM orders TABLESAMPLE SYSTEM (1);Gotcha:
TABLESAMPLEreturns an approximate percentage, not an exact row count, and samples at the page level, so the distribution isn't perfectly uniform. When you need exactly N rows, combine the two: sample coarsely, then refine.SELECT id, email FROM users TABLESAMPLE SYSTEM (10) ORDER BY random() LIMIT 5;Random integers
RANDOM()gives a fraction, but you often need anintin a range. The formula isfloor(random() * n)for[0, n):-- integer in [0, 100) SELECT floor(random() * 100)::int AS n; -- integer in [1, 6] like a dice roll SELECT floor(random() * 6)::int + 1 AS dice;A practical use is scattering employees into random buckets or assigning a random discount:
SELECT id, name, dept, floor(random() * 3)::int AS shard FROM employees;Gotcha: don't write
round(random() * n)for a uniform integer.roundpulls toward the endpoints half as often, so the edges (0 and n) get half the probability. Always usefloor.Reproducibility with setseed
Randomness is awkward for tests: results change on every run.
setseed()pins the generator for the current session, taking a value in[-1, 1]:SELECT setseed(0.42); SELECT id FROM users ORDER BY random() LIMIT 3; -- same order on replayAfter
setseed, theRANDOM()sequence is deterministic until the nextsetseedor the end of the session, which is handy for repeatable tests and demos.MySQL and ClickHouse
Syntax diverges across engines:
RAND(), also[0, 1). Sampling isORDER BY RAND() LIMIT 5; there's noTABLESAMPLE. The seed is an argument:RAND(42).rand(), which is aUInt32in[0, 2^32), not a fraction. For[0,1)divide by4294967295.0; for sampling it has a nativeSELECT ... SAMPLE 0.1.-- MySQL: random rows + reproducible seed SELECT id, email FROM users ORDER BY RAND(42) LIMIT 5; -- ClickHouse: native sampling, no full sort SELECT id, amount FROM orders SAMPLE 0.1;The takeaway:
ORDER BY RANDOM()is convenient and correct but doesn't scale. On large data move to page-level sampling, and keepsetseed/RAND(seed)for reproducible output.