REGR_R2 returns the coefficient of determination (R-squared) of a linear regression: a number between 0 and 1 telling you how much of the variance in the dependent variable the regression line actually explains. It is the value that answers the question "can I even trust the slope that REGR_SLOPE just gave me?".
Syntax and core idea
The arguments match every other REGR_* function: the dependent variable y first, then the independent variable x.
SELECT REGR_R2(y, x) FROM points;
The intuition is simple: R-squared compares your model against the dumbest possible model, "always predict the mean of y". If the fitted line does no better than that mean, R-squared sits near 0. If the line passes exactly through every point, R-squared is 1.
1.0 — perfect line, every point lands on it.
0.0 — the slope is useless, x explains nothing about y.
0.7 — the line explains 70% of the variance in y.
Qualifying a revenue trend
Say we aggregate daily order totals and want to know whether revenue genuinely climbs over time or it is just noise. Compute the slope first, then the goodness of fit.
WITH daily AS (
SELECT
date_trunc('day', created_at) AS day,
SUM(amount) AS revenue
FROM orders
WHERE status = 'paid'
GROUP BY 1
)
SELECT
REGR_SLOPE(revenue, EXTRACT(EPOCH FROM day)) AS slope,
REGR_R2(revenue, EXTRACT(EPOCH FROM day)) AS r2,
COUNT(*) AS days
FROM daily;
If slope > 0 but r2 is, say, 0.04, the "growth" is an illusion: the day-to-day spread is enormous and the line explains almost nothing. A slope > 0 paired with r2 = 0.8, on the other hand, is a genuinely stable trend.
Relation to CORR squared
For a simple (single predictor) linear regression there is an exact identity: REGR_R2(y, x) equals the square of the Pearson correlation coefficient CORR(y, x). That makes for a handy sanity check.
SELECT
REGR_R2(salary, manager_id) AS r2,
POWER(CORR(salary, manager_id), 2) AS corr_squared
FROM employees
WHERE manager_id IS NOT NULL;
Both columns will match down to floating-point error. This also exposes the key limitation: R-squared measures linear dependence only. A perfect parabola can return an R-squared near zero even when the variables are rigidly related.
Segmenting by country
R-squared shines inside a GROUP BY: instead of one trend for everyone, you get a per-segment reliability score and can immediately drop the noisy ones.
SELECT
u.country,
REGR_SLOPE(o.amount, EXTRACT(EPOCH FROM o.created_at)) AS slope,
REGR_R2(o.amount, EXTRACT(EPOCH FROM o.created_at)) AS r2,
COUNT(*) AS n
FROM orders o
JOIN users u ON u.id = o.user_id
GROUP BY u.country
HAVING COUNT(*) >= 30
ORDER BY r2 DESC;
The HAVING COUNT(*) >= 30 clause is not cosmetic, it is a guard: with two points R-squared is always 1, and without a minimum threshold you will mistake random coincidences for patterns.
Gotchas
- A high R-squared is not causation. It only says the line describes the data well, not that
x drives y.
- NULLs are dropped pairwise. If
x or y is NULL, that row simply does not participate, exactly as in REGR_SLOPE/CORR.
- Missing in MySQL. Neither
REGR_R2 nor the other REGR_* functions ship with standard MySQL; you must compute it by hand via a CORR equivalent or raw variances. ClickHouse also lacks REGR_R2, but its windowed statistics let you assemble R-squared from a formula.
- Linearity only. An R-squared near zero means "no linear relationship", not "no relationship at all". Glance at the actual plot before drawing conclusions.
Practical rule: always compute REGR_SLOPE and REGR_R2 together. The slope gives direction and magnitude; R-squared tells you whether that slope is worth looking at at all.
REGR_R2returns the coefficient of determination (R-squared) of a linear regression: a number between 0 and 1 telling you how much of the variance in the dependent variable the regression line actually explains. It is the value that answers the question "can I even trust the slope thatREGR_SLOPEjust gave me?".Syntax and core idea
The arguments match every other
REGR_*function: the dependent variableyfirst, then the independent variablex.SELECT REGR_R2(y, x) FROM points;The intuition is simple: R-squared compares your model against the dumbest possible model, "always predict the mean of
y". If the fitted line does no better than that mean, R-squared sits near 0. If the line passes exactly through every point, R-squared is 1.1.0— perfect line, every point lands on it.0.0— the slope is useless,xexplains nothing abouty.0.7— the line explains 70% of the variance iny.Qualifying a revenue trend
Say we aggregate daily order totals and want to know whether revenue genuinely climbs over time or it is just noise. Compute the slope first, then the goodness of fit.
WITH daily AS ( SELECT date_trunc('day', created_at) AS day, SUM(amount) AS revenue FROM orders WHERE status = 'paid' GROUP BY 1 ) SELECT REGR_SLOPE(revenue, EXTRACT(EPOCH FROM day)) AS slope, REGR_R2(revenue, EXTRACT(EPOCH FROM day)) AS r2, COUNT(*) AS days FROM daily;If
slope > 0butr2is, say,0.04, the "growth" is an illusion: the day-to-day spread is enormous and the line explains almost nothing. Aslope > 0paired withr2 = 0.8, on the other hand, is a genuinely stable trend.Relation to CORR squared
For a simple (single predictor) linear regression there is an exact identity:
REGR_R2(y, x)equals the square of the Pearson correlation coefficientCORR(y, x). That makes for a handy sanity check.SELECT REGR_R2(salary, manager_id) AS r2, POWER(CORR(salary, manager_id), 2) AS corr_squared FROM employees WHERE manager_id IS NOT NULL;Both columns will match down to floating-point error. This also exposes the key limitation: R-squared measures linear dependence only. A perfect parabola can return an R-squared near zero even when the variables are rigidly related.
Segmenting by country
R-squared shines inside a
GROUP BY: instead of one trend for everyone, you get a per-segment reliability score and can immediately drop the noisy ones.SELECT u.country, REGR_SLOPE(o.amount, EXTRACT(EPOCH FROM o.created_at)) AS slope, REGR_R2(o.amount, EXTRACT(EPOCH FROM o.created_at)) AS r2, COUNT(*) AS n FROM orders o JOIN users u ON u.id = o.user_id GROUP BY u.country HAVING COUNT(*) >= 30 ORDER BY r2 DESC;The
HAVING COUNT(*) >= 30clause is not cosmetic, it is a guard: with two points R-squared is always 1, and without a minimum threshold you will mistake random coincidences for patterns.Gotchas
xdrivesy.xoryis NULL, that row simply does not participate, exactly as inREGR_SLOPE/CORR.REGR_R2nor the otherREGR_*functions ship with standard MySQL; you must compute it by hand via aCORRequivalent or raw variances. ClickHouse also lacksREGR_R2, but its windowed statistics let you assemble R-squared from a formula.Practical rule: always compute
REGR_SLOPEandREGR_R2together. The slope gives direction and magnitude; R-squared tells you whether that slope is worth looking at at all.