sqlpostgresqlstatisticsaggregate

CORR in SQL: Pearson Correlation as a Single Aggregate

How CORR(y, x) computes Pearson correlation, what its sign and magnitude mean, how NULL pairs behave, and how to add a trend line with REGR_*.

3 min readReferencesql · postgresql · statistics · aggregate · analytics

CORR(y, x) returns the Pearson correlation coefficient between two numeric columns as a single aggregate. The result always sits between -1 and 1 and tells you how strongly the two values move together in a linear way.

What CORR computes and how to read it

The function measures a linear relationship: how well the cloud of (x, y) points falls on a straight line. You read the value by its sign and its magnitude:

  • +1 is a perfect direct relationship: more x, more y.
  • -1 is a perfect inverse one: more x, less y.
  • 0 means no linear link (a non-linear one may still exist).
  • |r| of about 0.7 and up is usually called strong, 0.3-0.7 moderate.

Argument order does not change the coefficient itself: CORR(a, b) = CORR(b, a). Order will start to matter later, with the regression functions.

-- Correlation between order amount and the user's account age in days
SELECT CORR(
         o.amount,
         EXTRACT(EPOCH FROM (o.created_at - u.created_at)) / 86400
       ) AS amount_vs_age
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status = 'paid';

Correlation is not causation

The big trap. A high r only says the numbers move together, not that one drives the other. Classic sources of spurious correlation:

  • A hidden common factor (seasonality or a campaign pushes up both orders and traffic).
  • Scale effects: large customers place more orders and have a bigger average ticket.
  • Plain noise on a small sample: r = 0.9 over 5 rows means almost nothing.
-- Per-country correlation, but only where the sample is large enough to trust
SELECT u.country,
       COUNT(*)                  AS n,
       CORR(o.amount, u.id)      AS r
FROM orders o
JOIN users u ON u.id = o.user_id
GROUP BY u.country
HAVING COUNT(*) >= 30
ORDER BY r DESC;

Always report COUNT(*) alongside it: a coefficient without a sample size is easy to misread.

NULL handling and edge cases

CORR only uses pairs where both values are non-NULL. If either x or y is NULL, the whole row is silently dropped from the calculation. This is the default behaviour, not an error.

  • If there are fewer than two valid pairs, the result is NULL.
  • If x or y is constant (zero variance), the division by zero yields NULL rather than raising an exception.
-- Salary vs manager presence: rows with NULL manager_id are dropped automatically
SELECT dept,
       COUNT(*)                       AS rows_total,
       COUNT(salary)                  AS rows_with_salary,
       CORR(salary, manager_id)       AS r
FROM employees
GROUP BY dept;

Gotcha: COUNT(*) and the number of pairs CORR actually used can differ. To know the honest sample size, count the pairs explicitly, for example with COUNT(*) FILTER (WHERE salary IS NOT NULL AND manager_id IS NOT NULL).

Pairing with REGR_* for a trend line

CORR answers "how strong is the link" but gives you no equation for the line. The REGR_* family does, and it is cheap to compute in the same single pass:

  • REGR_SLOPE(y, x) is the slope of the trend line.
  • REGR_INTERCEPT(y, x) is where it crosses the y axis.
  • REGR_R2(y, x) is the coefficient of determination, equal to r squared.
-- Trend line for amount as a function of account age, plus strength of fit
SELECT REGR_SLOPE(o.amount, age_days)     AS slope,
       REGR_INTERCEPT(o.amount, age_days) AS intercept,
       CORR(o.amount, age_days)           AS r,
       REGR_R2(o.amount, age_days)        AS r_squared
FROM (
  SELECT o.amount,
         EXTRACT(EPOCH FROM (o.created_at - u.created_at)) / 86400 AS age_days
  FROM orders o
  JOIN users u ON u.id = o.user_id
) o;

Forecast: predicted_amount = intercept + slope * age_days.

Differences across databases

  • PostgreSQL: full support for CORR, COVAR_POP, COVAR_SAMP and the whole REGR_* set.
  • ClickHouse: has corr(x, y) but no REGR_*; you derive the slope by hand from covarPop and varPop.
  • MySQL: no CORR at all. You build the formula from AVG, STDDEV_POP and the mean product, or compute it in the application.
-- Portable fallback that works even without a built-in CORR
SELECT (AVG(x * y) - AVG(x) * AVG(y))
       / (STDDEV_POP(x) * STDDEV_POP(y)) AS r
FROM (SELECT amount AS x, user_id AS y FROM orders) t;

Bottom line: CORR is a cheap one-pass way to gauge the strength of a linear link, and a nearby REGR_* turns that estimate into a ready-to-use trend equation.

Practice on real tasks

Solve tasks in the SQL trainer with instant grading and hints.

Open trainer