COUNT(DISTINCT) in SQL: Counting Unique Values and Its Cost

COUNT(DISTINCT col) answers one of the most common analytical questions: how many unique values are in a column. It sounds trivial, but on large tables it is one of the most expensive aggregations in SQL, and it pays to understand why.

Basic unique-value counting

A plain COUNT(*) counts rows, while COUNT(DISTINCT ...) first removes duplicates and then counts what remains. Compare:

-- Total order rows
SELECT COUNT(*) FROM orders;

-- How many distinct customers actually ordered
SELECT COUNT(DISTINCT user_id) FROM orders;

-- Distinct countries among registered users
SELECT COUNT(DISTINCT country) FROM users;

A few behaviors worth knowing:

NULL is ignored: COUNT(DISTINCT col) does not treat NULL as a value.
It is commonly paired with GROUP BY, for example "unique buyers per order status":

SELECT status, COUNT(DISTINCT user_id) AS unique_buyers
FROM orders
GROUP BY status;

Why it is expensive at scale

COUNT(*) can lean on statistics and indexes and sometimes barely reads data. COUNT(DISTINCT) cannot: to drop duplicates the engine must materialize every value and either sort it or load it into a hash table.

No index-only shortcut: even an index on user_id does not store a ready-made count of distinct values.
Memory: with millions of distinct values the hash table outgrows work_mem, and PostgreSQL spills to an on-disk sort.
Multiple COUNT(DISTINCT) calls in one query are computed separately, each building its own structure:

-- Each DISTINCT is computed independently -> heavy
SELECT
  COUNT(DISTINCT user_id)  AS buyers,
  COUNT(DISTINCT country)  AS countries
FROM orders o
JOIN users u ON u.id = o.user_id;

Gotcha: in MySQL COUNT(DISTINCT a, b) works out of the box. In PostgreSQL that syntax is invalid; you must wrap the columns in a ROW(...) tuple or combine them another way.

Multi-column distinct

When "uniqueness" is defined by a pair of columns, use a tuple in PostgreSQL:

-- Unique (user, country) pairs in PostgreSQL
SELECT COUNT(DISTINCT (user_id, country)) AS uniq_pairs
FROM orders o
JOIN users u ON u.id = o.user_id;

An alternative is a GROUP BY subquery, which often reads better and optimizes more predictably:

SELECT COUNT(*) AS uniq_pairs
FROM (
  SELECT DISTINCT user_id, country
  FROM orders o
  JOIN users u ON u.id = o.user_id
) t;

Be careful with string concatenation (user_id || '-' || country): distinct inputs can collide into the same string and give wrong results. A tuple or GROUP BY is safer.

DISTINCT inside other aggregates

DISTINCT is not limited to COUNT. You can place it inside SUM, AVG, array_agg, and string_agg to aggregate only unique values:

-- Sum of distinct salaries per department (dedupes equal salaries first)
SELECT dept, SUM(DISTINCT salary) AS sum_unique_salaries
FROM employees
GROUP BY dept;

-- Comma-separated list of distinct departments per manager
SELECT manager_id, string_agg(DISTINCT dept, ', ') AS depts
FROM employees
GROUP BY manager_id;

Remember that SUM(DISTINCT salary) is rarely meaningful for money: two separate 1000 payments collapse into one. Only use DISTINCT inside an aggregate when duplicates genuinely must be removed.

Approximate alternatives: APPROX and HLL

When you need an order-of-magnitude answer over billions of rows, an exact COUNT(DISTINCT) is overkill. The HyperLogLog algorithm estimates distinct counts in fixed memory (kilobytes) with roughly 1-2% error.

ClickHouse: uniq() and uniqHLL12() are approximate; uniqExact() is the exact equivalent of COUNT(DISTINCT).
BigQuery / Snowflake: APPROX_COUNT_DISTINCT(col).
PostgreSQL: the core has none, but the postgresql-hll extension adds it.

-- ClickHouse: fast approximate distinct
SELECT uniq(user_id) AS approx_buyers FROM orders;

-- BigQuery / Snowflake style
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_buyers FROM orders;

The key advantage of HLL is mergeability: partial sketches can be combined. That lets you store daily sketches and derive monthly uniques without re-scanning raw data, something an exact COUNT(DISTINCT) cannot do. Reach for approximate counts on dashboards and trend charts, and keep the exact count for billing and reporting where every unit must be correct.