Statistical View of Regression a MATLAB Tutorial

Regression and Least Squares: A MATLAB Tutorial Dr. Michael D. Porter [email protected] Department of Statistics North Carolina State University and SAMSI

Tuesday May 20, 2008

Introduction to Regression Goal: Express the relationship between between two (or more) variables by a mathematical formula. x is the predictor (independent) variable y is the response (dependent) variable

We specifically want to indicate how y varies as a function of x. y( x) is considered a random variable, so it can never be

predicted perfectly.

Introduction to Regression Goal: Express the relationship between between two (or more) variables by a mathematical formula. x is the predictor (independent) variable y is the response (dependent) variable

We specifically want to indicate how y varies as a function of x. y( x) is considered a random variable, so it can never be

predicted perfectly.

Example: Relating Shoe Size to Height The problem

Footwear Footwear impressions are commonly observed at crime cr ime scenes. While there are numerous forensic forensic properties that can be obtained from these impressions, impressions, one in particular par ticular is the shoe size. The detectives would like to be able to estimate the height of the impression maker from the shoe size.

Example: Relating Shoe Size to Height The data

Determining Height from Shoe Size 76 74 72 ) n70 i ( t h g68 i e H

66 64 62 60 6

7

8

9

10 11 12 Shoe Size (Mens)

13

14

Data taken from: http://staff.imsa.edu/ http://staff.imsa.edu/ brazzle/E2Kcurr/Forensic/Tracks/TracksSummary.html ∼

15

Example: Relating Shoe Size to Height Your answers Determining Height from Shoe Size 1

76 74 72 ) n70 i ( t h g68 i e H

66 64 62 60 6

7

8

9


13

14

15

What is the predictor? What is the response?



2

Can the height of the impression maker be accurately estimated from the shoe size?

76 74 72 ) n70 i ( t h g68 i e H

66 64 62 60 6

7

8

9


13

14

15



2


3

If a shoe is size 11, what would you advise the police?

76 74 72 ) n70 i ( t h g68 i e H

66 64 62 60 6

7

8

9


13

14

15



2


3

If a shoe is size 11, what would you advise the police?

4

What if the size is 7? Size 12.5?

76 74 72 ) n70 i ( t h g68 i e H

66 64 62 60 6

7

8

9


13

14

15

General Regression Model Assume the true model is of the form: y( x) = m( x) + ǫ( x)

The systematic part, m( x) is deterministic The error, ǫ( x) is a random variable Measurement error Natural variations due to exogenous factors Therefore, y( x) is also a random variable

The error is additive

Example: Sinusoid Function §

¤

¦

¥

y( x) = A · sin(ω x + φ) + ǫ( x)

A

=

1; ω

=

π/2; φ = π; σ

=

0 .5

2 y(x) m(x) 1.5

Amplitude A

1

Angular frequency ω

0.5 ) x ( y

Phase φ

0

Random error ǫ( x) ∼ N (0, σ 2 )

−0.5

−1

−1.5

−2

0

1

2

3

4

5

x

6

7

8

9

10

Regression Modeling We want to estimate m( x) and possibly the distribution of ǫ( x) There are two general situations: Theoretical Models m( x) is of some known (or hypothesized) form but with

some parameters unknown. (e.g. Sinusoid Function with A, ω , φ unknown)

Empirical Models m( x) is constructed from the observed data (e.g. Shoe size

and height)

We often end up using both: constructing models from the observed data and prior knowledge.

The Standard Assumptions §

¤

¦

¥

y( x) = m( x) + ǫ( x)

A1:

E[ǫ( x)]

=0

A2: Var [ǫ( x)] = σ 2

∀ x ∀ x

A3: Cov[ǫ( x), ǫ( x′ )] = 0

(Mean 0) (Homoskedastic)

= x′ ∀ x 

(Uncorrelated)

These assumptions are only on the error term.

ǫ( x) = y( x) − m( x)

Residuals The residuals e( xi ) = y( xi ) − m( xi )



can be used to check the estimated model, m( x).



If the model fit is good, the residuals should satisfy our three assumptions.

A1 - Mean 0 Satisfies A1

Violates A1 10

3

8 2

6 4

1

2 ) x ( e

) x ( e

0

0

−2 −1

−4 −6

−2

−8 −10 0

0.2

0.4

0.6

x

0.8

1

−3

0

0.2

0.4

0.6

x

0.8

1

A2 - Constant Variance Satisfies A2

Violates A2

) x ( e

30

3

20

2

10

1 ) x ( e

0

0

−10

−1

−20

−2

−30 0

0.2

0.4

0.6

x

0.8

1

−3

0

0.2

0.4

0.6

x

0.8

1

A3 - Uncorrelated Violates A3

Satisfies A3

1

3

0.8 2

0.6 0.4

1

0.2 ) x ( e

) x ( e

0

0

−0.2 −1

−0.4 −0.6

−2

−0.8 −1

0

0.2

0.4

0.6

x

0.8

1

−3

0

0.2

0.4

0.6

x

0.8

1

Back to the Shoes How can we estimate m( x) for the shoe example? (Non-parametric): For each shoe size, take the mean of the observed heights. (Parametric): Assume the trend is linear. Determining Height from Shoe Size Local Mean Linear Trend

76 74 72 ) n70 i ( t h g68 i e H

66 64 62 60 6

7

8

9

10

11

Shoe Size (Mens)

12

13

14

15

Simple Linear Regression Simple linear regression assumes that m( x) is of the parametric form m( x) = β 0 + β 1 x which is the equation for a line.

Simple Linear Regression Which line is the best estimate? Determining Height from Shoe Size Line #1 Line #2 Line #3

76

m( x) = β 0 + β 1 x

74

Line #1 Line #2 Line #3

72 ) n70 i ( t h g68 i e H

66 64 62 60 6

7

8

9

10

11

Shoe Size (Mens)

12

13

14

15

β 0 48.6 51.5 45.0

β 1 1.9 1.6 2.3

Estimating Parameters in Linear Regression Data

Write the observed data: yi = β 0 + β 1 xi + ǫi

i = 1, 2, . . . , n

where yi ≡ y( xi ) is the response value for observation i

β 0 and β 1 are the unknown parameters (regression coefficients) xi is the predictor value for observation i

ǫi ≡ ǫ( xi ) is the random error for observation i

Estimating Parameters in Linear Regression Statistical Decision Theory

Let g( x) ≡ g( x; β ) be an estimator for y( x) Define a Loss Function , L( y( x), g( x)) which describes how far g( x) is from y( x) Example Squared Error Loss L( y( x), g( x)) = ( y( x) − g( x))2

The best predictor minimizes the Risk (or expected Loss) R( x) =

E[ L( y( x), g( x))]

g∗ ( x) = arg min E[ L( y( x), g( x))] g∈G

Estimating Parameters in Linear Regression Method of Least Squares

If we assume a squared error loss function 2

L( yi , mi ) = ( yi − (β 0 + β 1 xi ))



An approximation to the Risk function is the Sum of Squared Errors (SS E ): n

 (β , β ) = (

R

0

yi − (β 0 + β 1 xi ))

1

2

i 1 =

Then it makes sense to estimate (β 0 , β 1 ) as the values that minimize R(β 0 , β 1 )

 

(β 0 , β 1 ) = arg min R(β 0 , β 1 ) B0 ,B1

Estimating Parameters in Linear Regression Derivation of Linear Least Squares Solution

n

 (β , β ) = (

R

0

yi − (β 0 + β 1 xi ))

1

2

i 1 =

Differentiate the Risk function with respect to the unknown parameters and equate to 0

 ∂  ∂β   ∂  ∂β 

n

R

0

2

0

=

R

1

 =− ( − (β + β )) =  =− ( − (β + β )) = yi

0

0

i 1 n =

2

=

1 xi

0

xi yi

i 1 =

0

1 xi

0

Estimating Parameters in Linear Regression Linear Least Squares Solution

n

 (β , β ) = (

R

0

yi − (β 0 + β 1 xi ))

1

2

i 1 =

The least square estimates are

 β =  β = ¯ − β ¯

n ¯ y¯ i 1 xi yi − n x n 2 2 ¯ x n x − i 1 i =

1

=

0

y

1 x

where x¯ and y¯ are the sample means of the xi ’s and yi ’s.

And the winner is ...

Line # 2! Determining Height from Shoe Size Line #1 Line #2 Line #3

76 74

For these data: x ¯ = 11.03 y¯ = 69.31

72 ) n70 i ( t h g68 i e H

β = β = 0

66

1

64 62 60 6

7

8

9

10

11

Shoe Size (Mens)

12

13

14

15

51.46 1.62

Residuals The fitted value, yi for the ith observation is



 = β + β yi

0

1 xi

The residual , ei is the difference between the observed and fitted value ei = yi − yi



The residuals are used to check if our three assumptions appear valid

Residuals for shoe size data Determining Height from Shoe Size 5 Residuals 4 3 2 l a u d i s e r

1 0

−1 −2 −3 −4 −5

6

7

8

9

10

11

Shoe Size (Mens)

12

13

14

15

Example of poor fit Scatter Plot

Residual Plot

9

4

8

3

7 2 6 1 ) x ( y

5

) x ( e

0

4 −1 3 −2 2 −3

1

0 −1

−0.8

−0.6

−0.4

−0.2

0

x

0.2

0.4

0.6

0.8

1

−4 −1

−0.8

−0.6

−0.4

−0.2

0

x

0.2

0.4

0.6

0.8

1

Adding Polynomial Terms in the Linear Model Modeling the mean trend as a line doesn’t seem to fit extremely well in the above example. There is a systematic lack of fit. Consider a polynomial form for the mean m( x) = β 0 + β 1 x + β 2 x2 + . . . + β p x p p

 = β

k x

k

k 0 =

This is still considered a linear model m( x) is a linear combination of the β k

Danger of over-fitting

Quadratic Fit: y( x) = β 0 + β 1 x + β 2 x2 + ǫ( x) 9

Scatter Plot 9 1st Order Quadratic

8

8

7

7

6

6

5 ) x ( y

) x ( y

4

5

4 3

3 2

2 1

1

0

−1 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

0 −1

1

4

Residual Plot (Quadratic Fit) 4

3 3

2 2

1 1 ) x ( e

0

−1

−2

−3

−0.8

−0.6

−0.4

−0.2

0

x

x

) x ( e

0

−1

−2

−3

0.2

0.4

0.6

0.8

1

Matrix Approach to Linear Least Squares Setup

Previously, we wrote our data as yi notation this becomes

 =

p k 0 =

β k x ik + ǫi . In matrix

Y = X β + ǫ

Y

   =

y1 y2

.. .

yn

  , 

X

   =

1 1

.. . 1

p x1 p x2

x1 x2

x12 x22

... ... .. .

xn

xn2

. . . x pn

.. .

.. .

.. .

  β   ǫ  , β =  β  , ǫ =  ǫ     0

1

1

2

.. . β p

How many unknown parameters are in the model?

.. . ǫn

  

Matrix Approach to Linear Least Squares Solution

To minimize SS E (Sum of Squared Errors), use Risk function R(β ) = (Y − X β )T (Y − X β ) Taking derivative w.r.t β gives the Normal Equations X T X β = X T Y

The least squares solution for β is ... Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang

Matrix Approach to Linear Least Squares Solution

To minimize SS E (Sum of Squared Errors), use Risk function R(β ) = (Y − X β )T (Y − X β ) Taking derivative w.r.t β gives the Normal Equations X T X β = X T Y

The least squares solution for β is ... Hint: See “Linear Inverse Problems: A MATLAB Tutorial” by Qin Zhang

β = (

X T X )−1 X T Y

STRETCH BREAK!!!

MATLAB Demonstration Linear Least Squares

MATLAB Demo #1 Open Regression_Intro.m

Model Selection How can we compare and select a final model? How many terms should be include in polynomial models? What is the danger of over-fitting? (Including too many terms) What is the problem with under-fitting? (Not including enough terms)

Estimating Variance Recall assumptions A1, A2, and A3:

Assumptions

For our fitted model, the residuals ei = yi − yi can be used to estimate Var [ǫ( x)].



An estimator for the variance is ... Hint: See “Basic Statistical Concepts and Some Probability Essentials” by Justin Shows and Betsy Enstrom

Estimating Variance Recall assumptions A1, A2, and A3:

Assumptions

For our fitted model, the residuals ei = yi − yi can be used to estimate Var [ǫ( x)].



An estimator for the variance is ... Hint: See “Basic Statistical Concepts and Some Probability Essentials” by Justin Shows and Betsy Enstrom

The Sample Variance s z2 =

1 n−1

n

(

¯)2 zi − z

i 1 =

Estimating Variance Sample Variance for a rv z s z2 =

n

1 n−1

(

zi − z¯)2

i 1 =

The estimator for the regression problem is similar

σ ˆǫ2 = =

1 n − ( p + 1)

n



e2i

i 1 =

SS E df

where the degrees of freedom df = n − ( p + 1). There are p + 1 unknown parameters in the model.

Statistical Inference An additional assumption

In order to calculate confidence intervals (C.I.), we need a distributional assumption on ǫ( x). Up to now, we haven’t needed one

The standard assumption is to assume a Normal or Gaussian distribution A4 :

ǫ( x) ∼ N (0, σ 2 )

Statistical Inference Distributions

Using y( xo ) = xT 0 β + ǫ( xo ) 2 y( xo ) ∼ N ( xT 0 β, σ )

β = (

X T X )−1 X T Y

where x0 is a point in design space . And the 4 assumptions, we find

  ( ) = N  β, σ ( ( ) = N β, σ ( + β ∼ MV N β, σ ( m xo y xo

xoT xoT

  

2 T xo X T X )−1 xo 2 1 xoT ( X T X )−1 xo )

X

2

X T X )−1

From these we can find CI’s and perform hypothesis tests.

Model Comparison

R2

Sum of Squares Error n

SS E

n

  = ( −  ) = yi

yi

2

i 1

e2i = e′ e

i 1

=

=

Sum of Squares Total n

SST

 = (

¯)2 yi − y

i 1 =

This is the model with intercept only y( x) = y¯.



Coefficient of Determination 2

R = 1 −

SS E SST

R2 is a measure of how much better a regression model is

than the intercept only.

Model Comparison Adjusted R2

What happens to R2 if you add more terms in the model? 2

R = 1 −

SS E SST

Model Comparison Adjusted R2

What happens to R2 if you add more terms in the model? 2

R = 1 −

SS E SST

Adjusted R2 penalizes by the number of terms ( p + 1) in the model R2adj

=1−

SS E /(n − ( p + 1)) SST /(n − 1)

σ ˆǫ =1− SST /(n − 1) Also see residual plots, Mallow’s C p , PRESS (cross-validation), AIC, etc.

MATLAB Demonstration cftool

MATLAB Demo #2 Type cftool

Nonlinear Regression A linear regression model can be written p

 ( )= β

k hk ( x)

y x

+ ǫ( x)

k 0 =

The mean, m( x) is a linear combination of the β ’s

Nonlinear regression takes the general form y( x) = m( x; β ) + ǫ( x)

for some specified function m( x; β ) with unknown parameters β .

Nonlinear Regression A linear regression model can be written p

 ( )= β

k hk ( x)

y x

+ ǫ( x)

k 0 =

The mean, m( x) is a linear combination of the β ’s

Nonlinear regression takes the general form y( x) = m( x; β ) + ǫ( x)

for some specified function m( x; β ) with unknown parameters β . Example The sinusoid we looked at earlier y( x) = A · sin(ω x + φ) + ǫ( x)

with parameters β = ( A, ω , φ) is a nonlinear model.

Nonlinear Regression Parameter Estimation

Making same assumptions as in linear regression (A1-A3), the least squares solution is still valid n

β =

arg min

(

yi − m( xi ; β ))

2

i 1 =

Unfortunately, this usually doesn’t have a closed form solution (like in the linear case) Approaches to finding the solution will be discussed later in the workshop But that won’t stop us from using nonlinear (and nonparametric) regression in MATLAB!

Off again to cftool MATLAB Demo #3

Weighted Regression Consider the risk functions we have considered so far n

 (β ) = (

2

yi − m( xi ; β ))

R

i 1 =

Each observation is equally contributes to the risk Weighted regression uses the risk function n

Rw

 (β ) =

wi ( yi − m( xi ; β ))

2

i 1 =

so observations with larger weights are more important. Some examples wi = 1/σi2

Heteroskedastic (Non-constant variance)

wi = 1/ xi wi = 1/ yi wi = k /|ei |

Robust Regression

Transformations Sometimes transformations are used to obtain better models Transform predictors x → x′ Transform response y → y′ Make sure assumptions A1-A3,A4 are still valid

Standardized ′

x =

x − x ¯ s x

Log y′ = log( y)

The Competition Contest to see who can construct the best model in cftool Get into groups Data can be found in competition data.m Scoring will be performed on testing set Want to minimize sum of squared errors When group is ready, enter model into this computer

MATLAB Help There is lots of good assistance in the MATLAB help window Specifically, look at the Demos tab on the help window The Toolboxes of Statistics (Regression) and Optimization may be particularly useful for this workshop

Statistical View of Regression a MATLAB Tutorial

Recommend Documents