Regression and Least Squares: A MATLAB Tutorial Dr. Michael D. Porter
[email protected] Department of Statistics North Carolina State University and SAMSI
Tuesday May 20, 2008
Introduction to Regression Goal: Express the relationship between between two (or more) variables by a mathematical formula. x is the predictor (independent) variable y is the response (dependent) variable
We specifically want to indicate how y varies as a function of x. y( x) is considered a random variable, so it can never be
predicted perfectly.
Introduction to Regression Goal: Express the relationship between between two (or more) variables by a mathematical formula. x is the predictor (independent) variable y is the response (dependent) variable
We specifically want to indicate how y varies as a function of x. y( x) is considered a random variable, so it can never be
predicted perfectly.
Example: Relating Shoe Size to Height The problem
Footwear Footwear impressions are commonly observed at crime cr ime scenes. While there are numerous forensic forensic properties that can be obtained from these impressions, impressions, one in particular par ticular is the shoe size. The detectives would like to be able to estimate the height of the impression maker from the shoe size.
Example: Relating Shoe Size to Height The data
Determining Height from Shoe Size 76 74 72 ) n70 i ( t h g68 i e H
66 64 62 60 6
7
8
9
10 11 12 Shoe Size (Mens)
13
14
Data taken from: http://staff.imsa.edu/ http://staff.imsa.edu/ brazzle/E2Kcurr/Forensic/Tracks/TracksSummary.html ∼
15
Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1
76 74 72 ) n70 i ( t h g68 i e H
66 64 62 60 6
7
8
9
10 11 12 Shoe Size (Mens)
13
14
15
What is the predictor? What is the response?
Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1
What is the predictor? What is the response?
2
Can the height of the impression maker be accurately estimated from the shoe size?
76 74 72 ) n70 i ( t h g68 i e H
66 64 62 60 6
7
8
9
10 11 12 Shoe Size (Mens)
13
14
15
Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1
What is the predictor? What is the response?
2
Can the height of the impression maker be accurately estimated from the shoe size?
3
If a shoe is size 11, what would you advise the police?
76 74 72 ) n70 i ( t h g68 i e H
66 64 62 60 6
7
8
9
10 11 12 Shoe Size (Mens)
13
14
15
Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1
What is the predictor? What is the response?
2
Can the height of the impression maker be accurately estimated from the shoe size?
3
If a shoe is size 11, what would you advise the police?
4
What if the size is 7? Size 12.5?
76 74 72 ) n70 i ( t h g68 i e H
66 64 62 60 6
7
8
9
10 11 12 Shoe Size (Mens)
13
14
15
General Regression Model Assume the true model is of the form: y( x) = m( x) + ǫ( x)
The systematic part, m( x) is deterministic The error, ǫ( x) is a random variable Measurement error Natural variations due to exogenous factors Therefore, y( x) is also a random variable
The error is additive
Example: Sinusoid Function §
¤
¦
¥
y( x) = A · sin(ω x + φ) + ǫ( x)
A
=
1; ω
=
π/2; φ = π; σ
=
0 .5
2 y(x) m(x) 1.5
Amplitude A
1
Angular frequency ω
0.5 ) x ( y
Phase φ
0
Random error ǫ( x) ∼ N (0, σ 2 )
−0.5
−1
−1.5
−2
0
1
2
3
4
5
x
6
7
8
9
10
Regression Modeling We want to estimate m( x) and possibly the distribution of ǫ( x) There are two general situations: Theoretical Models m( x) is of some known (or hypothesized) form but with
some parameters unknown. (e.g. Sinusoid Function with A, ω , φ unknown)
Empirical Models m( x) is constructed from the observed data (e.g. Shoe size
and height)
We often end up using both: constructing models from the observed data and prior knowledge.
The Standard Assumptions §
¤
¦
¥
y( x) = m( x) + ǫ( x)
A1:
E[ǫ( x)]
=0
A2: Var [ǫ( x)] = σ 2
∀ x ∀ x
A3: Cov[ǫ( x), ǫ( x′ )] = 0
(Mean 0) (Homoskedastic)
= x′ ∀ x
(Uncorrelated)
These assumptions are only on the error term.
ǫ( x) = y( x) − m( x)
Residuals The residuals e( xi ) = y( xi ) − m( xi )
can be used to check the estimated model, m( x).
If the model fit is good, the residuals should satisfy our three assumptions.
A1 - Mean 0 Satisfies A1
Violates A1 10
3
8 2
6 4
1
2 ) x ( e
) x ( e
0
0
−2 −1
−4 −6
−2
−8 −10 0
0.2
0.4
0.6
x
0.8
1
−3
0
0.2
0.4
0.6
x
0.8
1
A2 - Constant Variance Satisfies A2
Violates A2
) x ( e
30
3
20
2
10
1 ) x ( e
0
0
−10
−1
−20
−2
−30 0
0.2
0.4
0.6
x
0.8
1
−3
0
0.2
0.4
0.6
x
0.8
1
A3 - Uncorrelated Violates A3
Satisfies A3
1
3
0.8 2
0.6 0.4
1
0.2 ) x ( e
) x ( e
0
0
−0.2 −1
−0.4 −0.6
−2
−0.8 −1
0
0.2
0.4
0.6
x
0.8
1
−3
0
0.2
0.4
0.6
x
0.8
1
Back to the Shoes How can we estimate m( x) for the shoe example? (Non-parametric): For each shoe size, take the mean of the observed heights. (Parametric): Assume the trend is linear. Determining Height from Shoe Size Local Mean Linear Trend
76 74 72 ) n70 i ( t h g68 i e H
66 64 62 60 6
7
8
9
10
11
Shoe Size (Mens)
12
13
14
15
Simple Linear Regression Simple linear regression assumes that m( x) is of the parametric form m( x) = β 0 + β 1 x which is the equation for a line.
Simple Linear Regression Which line is the best estimate? Determining Height from Shoe Size Line #1 Line #2 Line #3
76
m( x) = β 0 + β 1 x
74
Line #1 Line #2 Line #3
72 ) n70 i ( t h g68 i e H
66 64 62 60 6
7
8
9
10
11
Shoe Size (Mens)
12
13
14
15
β 0 48.6 51.5 45.0
β 1 1.9 1.6 2.3
Estimating Parameters in Linear Regression Data
Write the observed data: yi = β 0 + β 1 xi + ǫi
i = 1, 2, . . . , n
where yi ≡ y( xi ) is the response value for observation i
β 0 and β 1 are the unknown parameters (regression coefficients) xi is the predictor value for observation i
ǫi ≡ ǫ( xi ) is the random error for observation i
Estimating Parameters in Linear Regression Statistical Decision Theory
Let g( x) ≡ g( x; β ) be an estimator for y( x) Define a Loss Function , L( y( x), g( x)) which describes how far g( x) is from y( x) Example Squared Error Loss L( y( x), g( x)) = ( y( x) − g( x))2
The best predictor minimizes the Risk (or expected Loss) R( x) =
E[ L( y( x), g( x))]
g∗ ( x) = arg min E[ L( y( x), g( x))] g∈G
Estimating Parameters in Linear Regression Method of Least Squares
If we assume a squared error loss function 2
L( yi , mi ) = ( yi − (β 0 + β 1 xi ))
An approximation to the Risk function is the Sum of Squared Errors (SS E ): n
(β , β ) = (
R
0
yi − (β 0 + β 1 xi ))
1
2
i 1 =
Then it makes sense to estimate (β 0 , β 1 ) as the values that minimize R(β 0 , β 1 )
(β 0 , β 1 ) = arg min R(β 0 , β 1 ) B0 ,B1
Estimating Parameters in Linear Regression Derivation of Linear Least Squares Solution
n
(β , β ) = (
R
0
yi − (β 0 + β 1 xi ))
1
2
i 1 =
Differentiate the Risk function with respect to the unknown parameters and equate to 0
∂ ∂β ∂ ∂β
n
R
0
2
0
=
R
1
=− ( − (β + β )) = =− ( − (β + β )) = yi
0
0
i 1 n =
2
=
1 xi
0
xi yi
i 1 =
0
1 xi
0
Estimating Parameters in Linear Regression Linear Least Squares Solution
n
(β , β ) = (
R
0
yi − (β 0 + β 1 xi ))
1
2
i 1 =
The least square estimates are
β = β = ¯ − β ¯
n ¯ y¯ i 1 xi yi − n x n 2 2 ¯ x n x − i 1 i =
1
=
0
y
1 x
where x¯ and y¯ are the sample means of the xi ’s and yi ’s.
And the winner is ...
Line # 2! Determining Height from Shoe Size Line #1 Line #2 Line #3
76 74
For these data: x ¯ = 11.03 y¯ = 69.31
72 ) n70 i ( t h g68 i e H
β = β = 0
66
1
64 62 60 6
7
8
9
10
11
Shoe Size (Mens)
12
13
14
15
51.46 1.62
Residuals The fitted value, yi for the ith observation is
= β + β yi
0
1 xi
The residual , ei is the difference between the observed and fitted value ei = yi − yi
The residuals are used to check if our three assumptions appear valid
Residuals for shoe size data Determining Height from Shoe Size 5 Residuals 4 3 2 l a u d i s e r
1 0
−1 −2 −3 −4 −5
6
7
8
9
10
11
Shoe Size (Mens)
12
13
14
15
Example of poor fit Scatter Plot
Residual Plot
9
4
8
3
7 2 6 1 ) x ( y
5
) x ( e
0
4 −1 3 −2 2 −3
1
0 −1
−0.8
−0.6
−0.4
−0.2
0
x
0.2
0.4
0.6
0.8
1
−4 −1
−0.8
−0.6
−0.4
−0.2
0
x
0.2
0.4
0.6
0.8
1
Adding Polynomial Terms in the Linear Model Modeling the mean trend as a line doesn’t seem to fit extremely well in the above example. There is a systematic lack of fit. Consider a polynomial form for the mean m( x) = β 0 + β 1 x + β 2 x2 + . . . + β p x p p
= β
k x
k
k 0 =
This is still considered a linear model m( x) is a linear combination of the β k
Danger of over-fitting
Quadratic Fit: y( x) = β 0 + β 1 x + β 2 x2 + ǫ( x) 9
Scatter Plot 9 1st Order Quadratic
8
8
7
7
6
6
5 ) x ( y
) x ( y
4
5
4 3
3 2
2 1
1
0
−1 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
0 −1
1
4
Residual Plot (Quadratic Fit) 4
3 3
2 2
1 1 ) x ( e
0
−1
−2
−3
−0.8
−0.6
−0.4
−0.2
0
x
x
) x ( e
0
−1
−2
−3
0.2
0.4
0.6
0.8
1
Matrix Approach to Linear Least Squares Setup
Previously, we wrote our data as yi notation this becomes
=
p k 0 =
β k x ik + ǫi . In matrix
Y = X β + ǫ
Y
=
y1 y2
.. .
yn
,
X
=
1 1
.. . 1
p x1 p x2
x1 x2
x12 x22
... ... .. .
xn
xn2
. . . x pn
.. .
.. .
.. .
β ǫ , β = β , ǫ = ǫ 0
1
1
2
.. . β p
How many unknown parameters are in the model?
.. . ǫn
Matrix Approach to Linear Least Squares Solution
To minimize SS E (Sum of Squared Errors), use Risk function R(β ) = (Y − X β )T (Y − X β ) Taking derivative w.r.t β gives the Normal Equations X T X β = X T Y
The least squares solution for β is ... Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang
Matrix Approach to Linear Least Squares Solution
To minimize SS E (Sum of Squared Errors), use Risk function R(β ) = (Y − X β )T (Y − X β ) Taking derivative w.r.t β gives the Normal Equations X T X β = X T Y
The least squares solution for β is ... Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang
β = (
X T X )−1 X T Y
STRETCH BREAK!!!
MATLAB Demonstration Linear Least Squares
MATLAB Demo #1 Open Regression_Intro.m
Model Selection How can we compare and select a final model? How many terms should be include in polynomial models? What is the danger of over-fitting? (Including too many terms) What is the problem with under-fitting? (Not including enough terms)
Estimating Variance Recall assumptions A1, A2, and A3:
Assumptions
For our fitted model, the residuals ei = yi − yi can be used to estimate Var [ǫ( x)].
An estimator for the variance is ... Hint: See “Basic Statistical Concepts and Some Probability Essentials” by Justin Shows and Betsy Enstrom
Estimating Variance Recall assumptions A1, A2, and A3:
Assumptions
For our fitted model, the residuals ei = yi − yi can be used to estimate Var [ǫ( x)].
An estimator for the variance is ... Hint: See “Basic Statistical Concepts and Some Probability Essentials” by Justin Shows and Betsy Enstrom
The Sample Variance s z2 =
1 n−1
n
(
¯)2 zi − z
i 1 =
Estimating Variance Sample Variance for a rv z s z2 =
n
1 n−1
(
zi − z¯)2
i 1 =
The estimator for the regression problem is similar
σ ˆǫ2 = =
1 n − ( p + 1)
n
e2i
i 1 =
SS E df
where the degrees of freedom df = n − ( p + 1). There are p + 1 unknown parameters in the model.
Statistical Inference An additional assumption
In order to calculate confidence intervals (C.I.), we need a distributional assumption on ǫ( x). Up to now, we haven’t needed one
The standard assumption is to assume a Normal or Gaussian distribution A4 :
ǫ( x) ∼ N (0, σ 2 )
Statistical Inference Distributions
Using y( xo ) = xT 0 β + ǫ( xo ) 2 y( xo ) ∼ N ( xT 0 β, σ )
β = (
X T X )−1 X T Y
where x0 is a point in design space . And the 4 assumptions, we find
( ) = N β, σ ( ( ) = N β, σ ( + β ∼ MV N β, σ ( m xo y xo
xoT xoT
2 T xo X T X )−1 xo 2 1 xoT ( X T X )−1 xo )
X
2
X T X )−1
From these we can find CI’s and perform hypothesis tests.
Model Comparison
R2
Sum of Squares Error n
SS E
n
= ( − ) = yi
yi
2
i 1
e2i = e′ e
i 1
=
=
Sum of Squares Total n
SST
= (
¯)2 yi − y
i 1 =
This is the model with intercept only y( x) = y¯.
Coefficient of Determination 2
R = 1 −
SS E SST
R2 is a measure of how much better a regression model is
than the intercept only.
Model Comparison Adjusted R2
What happens to R2 if you add more terms in the model? 2
R = 1 −
SS E SST
Model Comparison Adjusted R2
What happens to R2 if you add more terms in the model? 2
R = 1 −
SS E SST
Adjusted R2 penalizes by the number of terms ( p + 1) in the model R2adj
=1−
SS E /(n − ( p + 1)) SST /(n − 1)
σ ˆǫ =1− SST /(n − 1) Also see residual plots, Mallow’s C p , PRESS (cross-validation), AIC, etc.
MATLAB Demonstration cftool
MATLAB Demo #2 Type cftool
Nonlinear Regression A linear regression model can be written p
( )= β
k hk ( x)
y x
+ ǫ( x)
k 0 =
The mean, m( x) is a linear combination of the β ’s
Nonlinear regression takes the general form y( x) = m( x; β ) + ǫ( x)
for some specified function m( x; β ) with unknown parameters β .
Nonlinear Regression A linear regression model can be written p
( )= β
k hk ( x)
y x
+ ǫ( x)
k 0 =
The mean, m( x) is a linear combination of the β ’s
Nonlinear regression takes the general form y( x) = m( x; β ) + ǫ( x)
for some specified function m( x; β ) with unknown parameters β . Example The sinusoid we looked at earlier y( x) = A · sin(ω x + φ) + ǫ( x)
with parameters β = ( A, ω , φ) is a nonlinear model.
Nonlinear Regression Parameter Estimation
Making same assumptions as in linear regression (A1-A3), the least squares solution is still valid n
β =
arg min
(
yi − m( xi ; β ))
2
i 1 =
Unfortunately, this usually doesn’t have a closed form solution (like in the linear case) Approaches to finding the solution will be discussed later in the workshop But that won’t stop us from using nonlinear (and nonparametric) regression in MATLAB!
Off again to cftool MATLAB Demo #3
Weighted Regression Consider the risk functions we have considered so far n
(β ) = (
2
yi − m( xi ; β ))
R
i 1 =
Each observation is equally contributes to the risk Weighted regression uses the risk function n
Rw
(β ) =
wi ( yi − m( xi ; β ))
2
i 1 =
so observations with larger weights are more important. Some examples wi = 1/σi2
Heteroskedastic (Non-constant variance)
wi = 1/ xi wi = 1/ yi wi = k /|ei |
Robust Regression
Transformations Sometimes transformations are used to obtain better models Transform predictors x → x′ Transform response y → y′ Make sure assumptions A1-A3,A4 are still valid
Standardized ′
x =
x − x ¯ s x
Log y′ = log( y)
The Competition Contest to see who can construct the best model in cftool Get into groups Data can be found in competition data.m Scoring will be performed on testing set Want to minimize sum of squared errors When group is ready, enter model into this computer
MATLAB Help There is lots of good assistance in the MATLAB help window Specifically, look at the Demos tab on the help window The Toolboxes of Statistics (Regression) and Optimization may be particularly useful for this workshop