REGR_R2 in SQL: R-squared and Goodness of Fit for Regression

REGR_R2 returns the coefficient of determination (R-squared) of a linear regression: a number between 0 and 1 telling you how much of the variance in the dependent variable the regression line actually explains. It is the value that answers the question "can I even trust the slope that REGR_SLOPE just gave me?".

Syntax and core idea

The arguments match every other REGR_* function: the dependent variable y first, then the independent variable x.

SELECT REGR_R2(y, x) FROM points;

The intuition is simple: R-squared compares your model against the dumbest possible model, "always predict the mean of y". If the fitted line does no better than that mean, R-squared sits near 0. If the line passes exactly through every point, R-squared is 1.

1.0 — perfect line, every point lands on it.
0.0 — the slope is useless, x explains nothing about y.
0.7 — the line explains 70% of the variance in y.

Qualifying a revenue trend

Say we aggregate daily order totals and want to know whether revenue genuinely climbs over time or it is just noise. Compute the slope first, then the goodness of fit.

WITH daily AS (
    SELECT
        date_trunc('day', created_at) AS day,
        SUM(amount) AS revenue
    FROM orders
    WHERE status = 'paid'
    GROUP BY 1
)
SELECT
    REGR_SLOPE(revenue, EXTRACT(EPOCH FROM day)) AS slope,
    REGR_R2(revenue, EXTRACT(EPOCH FROM day))    AS r2,
    COUNT(*)                                      AS days
FROM daily;

If slope > 0 but r2 is, say, 0.04, the "growth" is an illusion: the day-to-day spread is enormous and the line explains almost nothing. A slope > 0 paired with r2 = 0.8, on the other hand, is a genuinely stable trend.

Relation to CORR squared

For a simple (single predictor) linear regression there is an exact identity: REGR_R2(y, x) equals the square of the Pearson correlation coefficient CORR(y, x). That makes for a handy sanity check.

SELECT
    REGR_R2(salary, manager_id)        AS r2,
    POWER(CORR(salary, manager_id), 2) AS corr_squared
FROM employees
WHERE manager_id IS NOT NULL;

Both columns will match down to floating-point error. This also exposes the key limitation: R-squared measures linear dependence only. A perfect parabola can return an R-squared near zero even when the variables are rigidly related.

Segmenting by country

R-squared shines inside a GROUP BY: instead of one trend for everyone, you get a per-segment reliability score and can immediately drop the noisy ones.

SELECT
    u.country,
    REGR_SLOPE(o.amount, EXTRACT(EPOCH FROM o.created_at)) AS slope,
    REGR_R2(o.amount, EXTRACT(EPOCH FROM o.created_at))    AS r2,
    COUNT(*)                                                AS n
FROM orders o
JOIN users u ON u.id = o.user_id
GROUP BY u.country
HAVING COUNT(*) >= 30
ORDER BY r2 DESC;

The HAVING COUNT(*) >= 30 clause is not cosmetic, it is a guard: with two points R-squared is always 1, and without a minimum threshold you will mistake random coincidences for patterns.

Gotchas

A high R-squared is not causation. It only says the line describes the data well, not that x drives y.
NULLs are dropped pairwise. If x or y is NULL, that row simply does not participate, exactly as in REGR_SLOPE/CORR.
Missing in MySQL. Neither REGR_R2 nor the other REGR_* functions ship with standard MySQL; you must compute it by hand via a CORR equivalent or raw variances. ClickHouse also lacks REGR_R2, but its windowed statistics let you assemble R-squared from a formula.
Linearity only. An R-squared near zero means "no linear relationship", not "no relationship at all". Glance at the actual plot before drawing conclusions.

Practical rule: always compute REGR_SLOPE and REGR_R2 together. The slope gives direction and magnitude; R-squared tells you whether that slope is worth looking at at all.