CORR(y, x) returns the Pearson correlation coefficient between two numeric columns as a single aggregate. The result always sits between -1 and 1 and tells you how strongly the two values move together in a linear way.
What CORR computes and how to read it
The function measures a linear relationship: how well the cloud of (x, y) points falls on a straight line. You read the value by its sign and its magnitude:
+1 is a perfect direct relationship: more x, more y.
-1 is a perfect inverse one: more x, less y.
0 means no linear link (a non-linear one may still exist).
|r| of about 0.7 and up is usually called strong, 0.3-0.7 moderate.
Argument order does not change the coefficient itself: CORR(a, b) = CORR(b, a). Order will start to matter later, with the regression functions.
SELECT CORR(
o.amount,
EXTRACT(EPOCH FROM (o.created_at - u.created_at)) / 86400
) AS amount_vs_age
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status = 'paid';
Correlation is not causation
The big trap. A high r only says the numbers move together, not that one drives the other. Classic sources of spurious correlation:
- A hidden common factor (seasonality or a campaign pushes up both orders and traffic).
- Scale effects: large customers place more orders and have a bigger average ticket.
- Plain noise on a small sample:
r = 0.9 over 5 rows means almost nothing.
SELECT u.country,
COUNT(*) AS n,
CORR(o.amount, u.id) AS r
FROM orders o
JOIN users u ON u.id = o.user_id
GROUP BY u.country
HAVING COUNT(*) >= 30
ORDER BY r DESC;
Always report COUNT(*) alongside it: a coefficient without a sample size is easy to misread.
NULL handling and edge cases
CORR only uses pairs where both values are non-NULL. If either x or y is NULL, the whole row is silently dropped from the calculation. This is the default behaviour, not an error.
- If there are fewer than two valid pairs, the result is
NULL.
- If
x or y is constant (zero variance), the division by zero yields NULL rather than raising an exception.
SELECT dept,
COUNT(*) AS rows_total,
COUNT(salary) AS rows_with_salary,
CORR(salary, manager_id) AS r
FROM employees
GROUP BY dept;
Gotcha: COUNT(*) and the number of pairs CORR actually used can differ. To know the honest sample size, count the pairs explicitly, for example with COUNT(*) FILTER (WHERE salary IS NOT NULL AND manager_id IS NOT NULL).
Pairing with REGR_* for a trend line
CORR answers "how strong is the link" but gives you no equation for the line. The REGR_* family does, and it is cheap to compute in the same single pass:
REGR_SLOPE(y, x) is the slope of the trend line.
REGR_INTERCEPT(y, x) is where it crosses the y axis.
REGR_R2(y, x) is the coefficient of determination, equal to r squared.
SELECT REGR_SLOPE(o.amount, age_days) AS slope,
REGR_INTERCEPT(o.amount, age_days) AS intercept,
CORR(o.amount, age_days) AS r,
REGR_R2(o.amount, age_days) AS r_squared
FROM (
SELECT o.amount,
EXTRACT(EPOCH FROM (o.created_at - u.created_at)) / 86400 AS age_days
FROM orders o
JOIN users u ON u.id = o.user_id
) o;
Forecast: predicted_amount = intercept + slope * age_days.
Differences across databases
- PostgreSQL: full support for
CORR, COVAR_POP, COVAR_SAMP and the whole REGR_* set.
- ClickHouse: has
corr(x, y) but no REGR_*; you derive the slope by hand from covarPop and varPop.
- MySQL: no
CORR at all. You build the formula from AVG, STDDEV_POP and the mean product, or compute it in the application.
SELECT (AVG(x * y) - AVG(x) * AVG(y))
/ (STDDEV_POP(x) * STDDEV_POP(y)) AS r
FROM (SELECT amount AS x, user_id AS y FROM orders) t;
Bottom line: CORR is a cheap one-pass way to gauge the strength of a linear link, and a nearby REGR_* turns that estimate into a ready-to-use trend equation.
CORR(y, x)returns the Pearson correlation coefficient between two numeric columns as a single aggregate. The result always sits between -1 and 1 and tells you how strongly the two values move together in a linear way.What CORR computes and how to read it
The function measures a linear relationship: how well the cloud of
(x, y)points falls on a straight line. You read the value by its sign and its magnitude:+1is a perfect direct relationship: morex, morey.-1is a perfect inverse one: morex, lessy.0means no linear link (a non-linear one may still exist).|r|of about0.7and up is usually called strong,0.3-0.7moderate.Argument order does not change the coefficient itself:
CORR(a, b) = CORR(b, a). Order will start to matter later, with the regression functions.-- Correlation between order amount and the user's account age in days SELECT CORR( o.amount, EXTRACT(EPOCH FROM (o.created_at - u.created_at)) / 86400 ) AS amount_vs_age FROM orders o JOIN users u ON u.id = o.user_id WHERE o.status = 'paid';Correlation is not causation
The big trap. A high
ronly says the numbers move together, not that one drives the other. Classic sources of spurious correlation:r = 0.9over 5 rows means almost nothing.-- Per-country correlation, but only where the sample is large enough to trust SELECT u.country, COUNT(*) AS n, CORR(o.amount, u.id) AS r FROM orders o JOIN users u ON u.id = o.user_id GROUP BY u.country HAVING COUNT(*) >= 30 ORDER BY r DESC;Always report
COUNT(*)alongside it: a coefficient without a sample size is easy to misread.NULL handling and edge cases
CORRonly uses pairs where both values are non-NULL. If eitherxoryisNULL, the whole row is silently dropped from the calculation. This is the default behaviour, not an error.NULL.xoryis constant (zero variance), the division by zero yieldsNULLrather than raising an exception.-- Salary vs manager presence: rows with NULL manager_id are dropped automatically SELECT dept, COUNT(*) AS rows_total, COUNT(salary) AS rows_with_salary, CORR(salary, manager_id) AS r FROM employees GROUP BY dept;Gotcha:
COUNT(*)and the number of pairsCORRactually used can differ. To know the honest sample size, count the pairs explicitly, for example withCOUNT(*) FILTER (WHERE salary IS NOT NULL AND manager_id IS NOT NULL).Pairing with REGR_* for a trend line
CORRanswers "how strong is the link" but gives you no equation for the line. TheREGR_*family does, and it is cheap to compute in the same single pass:REGR_SLOPE(y, x)is the slope of the trend line.REGR_INTERCEPT(y, x)is where it crosses theyaxis.REGR_R2(y, x)is the coefficient of determination, equal torsquared.-- Trend line for amount as a function of account age, plus strength of fit SELECT REGR_SLOPE(o.amount, age_days) AS slope, REGR_INTERCEPT(o.amount, age_days) AS intercept, CORR(o.amount, age_days) AS r, REGR_R2(o.amount, age_days) AS r_squared FROM ( SELECT o.amount, EXTRACT(EPOCH FROM (o.created_at - u.created_at)) / 86400 AS age_days FROM orders o JOIN users u ON u.id = o.user_id ) o;Forecast:
predicted_amount = intercept + slope * age_days.Differences across databases
CORR,COVAR_POP,COVAR_SAMPand the wholeREGR_*set.corr(x, y)but noREGR_*; you derive the slope by hand fromcovarPopandvarPop.CORRat all. You build the formula fromAVG,STDDEV_POPand the mean product, or compute it in the application.-- Portable fallback that works even without a built-in CORR SELECT (AVG(x * y) - AVG(x) * AVG(y)) / (STDDEV_POP(x) * STDDEV_POP(y)) AS r FROM (SELECT amount AS x, user_id AS y FROM orders) t;Bottom line:
CORRis a cheap one-pass way to gauge the strength of a linear link, and a nearbyREGR_*turns that estimate into a ready-to-use trend equation.