Slope and intercept

I recently discovered something very nice. Postgres has several functions geared towards helping you with statistics. I will cover two of them here.

Let's use it to solve something familiar: $$y = mx + b$$

You can use it to draw a straight line through a set and pretend you've made an AI.

Functions

We'll be using two functions for this.

Slope

regr_slope(Y, X)

The description says "slope of the least-squares-fit linear equation determined by the (X, Y) pairs". In other words, this is our \( m \).

Note that Y comes before X.

Intercept

regr_intercept(Y, X)

The description says "y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs". In other words, this is our \( b \).

Note that Y comes before X.

Example

Let's make a table and push some data into it:

select
    generate_series(1, 10) x,
    generate_series(4, 31, 3) y
into
    toy_example;
 x  | y  
----+----
  1 |  4
  2 |  7
  3 | 10
  4 | 13
  5 | 16
  6 | 19
  7 | 22
  8 | 25
  9 | 28
 10 | 31
(10 rows)

Now, just from looking at it, we can tell two things; \(m = 3\) and \(b = 1\). Like I said, it's a toy example. Let's still verify our assumptions by having Postgres give us those numbers.

select
    regr_slope(y, x) as m,
    regr_intercept(y, x) as b
from
    toy_example;
 m | b 
---+---
 3 | 1
(1 row)

Looks good so far. Let's use it to calculate the next few values

select
    x, m * x + b as y
from
    generate_series(9, 14) as x,
    (
        select
            regr_slope(y, x) as m,
            regr_intercept(y, x) as b
        from
            toy_example
    ) as mb;
 x  | y  
----+----
  9 | 28
 10 | 31
 11 | 34
 12 | 37
 13 | 40
 14 | 43
(6 rows)

Congratulations, you can now predict the future. Go forth and play the lottery!


Notes

Reddit user /u/spinur1848, in a comment:

Probably shouldn't be playing with those without regr_r2, which gives you square of the correlation coefficient.

The slope and intercept functions will (almost) always give you values, even if the quality of the regression is garbage. The squared correlation coefficient is a measure of goodness of fit. You should probably calculate this number any time you would want to calculate a slope and/or an intercept.