2. Multiple Linear Regression
- {b,bse,bstan,bpval} =
linreg
({x, y {,opt,om}})
-
estimates the coefficients
for a
linear problem from data x and y and calculates the
ANOVA table
- {xs,bs,pvalue} =
linregfs
(x, y {,alpha})
-
computes the forward selection and estimates the coefficients
for a linear problem from data x
and y and calculates the ANOVA table
- {b,bse,bstan,bpval} =
linregfs2
(x, y, colname {,opt})
-
computes the forward selection and estimates the coefficients
for a linear problem from data x
and y
- {b,bse,bstan,bpval} =
linregbs
(x, y, colname {,opt})
-
computes the backward elimination and estimates the coefficients
for a linear problem from data x
and y and calculates the ANOVA table
- {b,bse,bstan,bpval} =
linregstep
(x, y, colname {,opt})
-
computes the stepwise selection and estimates the coefficients
for a linear problem from data x
and y and calculates the ANOVA table
|
In this section, we consider the linear model
Looking at this model we are faced with two problems:
- Estimating the parameter vector
- Testing the significance of the components Xi
From the mathematical point of view, the second problem is about
reducing the dimension of the model. Seen with the eyes of the user,
it gives us information about building a parsimonious model.
To this end, we have to find a way to handle the
selection or removal of a variable.
We want to use a simulated data set to demonstrate the solution to these two
problems. This example is stored in
regr3.xpl
where we generate five
uniform [0,1] distributed variables
.
Only three of them
influence Y:
Here,
is a normally distributed error term.
randomize(1) ; sets a seed for the random generator
eps=normal(10) ; generates 10 standard normal errors
x1=uniform(10) ; generates 10 uniformly distributed values
x2=uniform(10)
x3=uniform(10)
x4=uniform(10)
x5=uniform(10)
x=x1~x2~x3~x4~x5 ; creates the x data matrix
y=2+2*x1-10*x3+0.5*x4+eps/10 ; creates y
z=x~y ; creates the data matrix z
z ; returns z
This shows
Contents of z
[ 1,] 0.98028 0.35235 0.29969 0.85909 0.62176 1.3936
[ 2,] 0.83795 0.82747 0.13025 0.79595 0.59754 2.7269
[ 3,] 0.15873 0.93534 0.91259 0.72789 0.43156 -6.5193
[ 4,] 0.67269 0.67909 0.28156 0.20918 0.19878 0.69022
[ 5,] 0.50166 0.97112 0.39945 0.57865 0.19337 -0.66278
[ 6,] 0.94527 0.36003 0.77747 0.029797 0.40124 -3.9237
[ 7,] 0.18426 0.29004 0.24534 0.44418 0.35116 0.11605
[ 8,] 0.36232 0.35453 0.53022 0.4497 0.8062 -2.3026
[ 9,] 0.50832 0.00516 0.90669 0.16523 0.75683 -5.9188
[10,] 0.76022 0.17825 0.37929 0.093234 0.17747 -0.20187
Let us start with the first problem and use the the quantlet
linreg
to estimate the parameters of the model
{beta,bse,bstan,bpval}=linreg(x,y) ; computes the linear
; regression
produces
A N O V A SS df MSS F-test P-value
______________________________________________________________
Regression 87.241 5 17.448 4700.763 0.0000
Residuals 0.015 4 0.004
Total Variation 87.255 9 9.695
Multiple R = 0.99991
R^2 = 0.99983
Adjusted R^2 = 0.99962
Standard Error = 0.06092
PARAMETERS Beta SE StandB t-test P-value
______________________________________________________________
b[ 0,]= 2.0745 0.0941 0.0000 22.056 0.0000
b[ 1,]= 1.9672 0.0742 0.1875 26.517 0.0000
b[ 2,]= 0.0043 0.0995 0.0005 0.043 0.9677
b[ 3,]= -10.0887 0.0936 -0.9201 -107.759 0.0000
b[ 4,]= 0.3991 0.1203 0.0387 3.318 0.0294
b[ 5,]= 0.0708 0.1355 0.0053 0.523 0.6289
We obtain the ANOVA and parameter tables which return the same
values as found in the previous section. Substituting the estimated
parameters
,
we get with our generated data set
We know that X2 and X5 do not have any influence on Y. This is
reflected by the fact that the estimates of
and
are close
to zero.
Now we reach the point where we are faced with our second problem, how to
eliminate these variables. We can get a first impression by considering the
parameter estimates and their t-values in the parameter
table. A t-value is small if there is no influence of the corresponding
variable. This is reflected in the p-value, which is the significance level
for testing the hypothesis that a parameter equals zero. From the above table,
we can see that only the p-values for the constant, X1, X3 and X4
are smaller than 0.05, the typical significance level for hypothesis testing.
The p-values of X2 and X5 are much larger than 0.05 which means
that they are not significantly different from zero.
The above way of choosing variables is convenient, but we want to know if the elimination
or selection of a variable improves the result or not. This leads immediately to
the stepwise model selection methods.
Let us first consider forward selection. The idea is to start from one
``good'' variable Xj and calculate the simple linear regression for
Then we decide stepwise for each of the remaining variables if
its inclusion to the model improves the fit of the model.
The algorithm is:
- FS1
- Choose the variable Xj with the highest t- or
F-value and calculate the simple linear regression.
- FS2
- Of the remaining variables, add the variable Xk which
fulfills one of the three (equivalent) criteria below:
- Xk has the highest sample partial correlation.
- The model with Xk increases the R2-value the most.
- Xk has the highest t- or F-value.
- FS3
- Repeat FS2 until one of the stopping rules applies:
- The order p of the model has reached a predetermined p*.
- The F-value is smaller then a predetermined value
Fin.
- Xk does not significantly improve the model fit.
A similar idea leads to backward elimination. We start with the
linear regression for the full model and eliminate stepwise variables without
influence.
- BE1
- Calculate the linear regression for the full model.
- BE2
- Eliminate the variable Xk with one of the following
(equivalent) properties:
- Xk has the smallest sample partial correlation
among all remaining variables.
- The removing of Xk causes the smallest change of R2.
- Of the remaining variables, Xk has the smallest t- or
F-values.
- BE3
- repeat BE2 until one of the following stopping rules is
valid:
- The order p of the model has reached a predetermined p*.
- The F-value is larger then a predetermined value
Fout.
- Removing Xk does not significantly change the model fit.
A kind of compromise between forward selection and backward elimination
is given by the stepwise selection method. Beginning with one variable
just like in forward selection we have to choose one of the four
alternatives:
- 1.
- Add a variable.
- 2.
- Remove a variable.
- 3.
- Exchange two variables.
- 4.
- Stop the selection.
This can be done with the following rules:
- ST1
- Add the variable Xk if one of the forward selection
criteria FS2 is satisfied.
- ST2
- Remove the variable Xk with the smallest F-value if
there are (possibly more than one) variables with an F-value smaller
than
Fout.
- ST3
- Remove the variable Xk with the smallest F-value if this
removal results in a larger R2-value than it was obtained with the
same number of variables before.
- ST4
- Exchange the variables Xk in the model and
not in the
model if this will increase the R2-value.
- ST5
- Stop the selection if neither ST1, ST2, ST3
nor ST4 is satisfied.
Remarks:
- The rules ST1, ST2 and ST3 only make sense if
there are two or more variables in the model. That is why they are only
admitted in this case.
- Considering ST3, we see the possibility that the same variable can
be added and removed in several steps of the procedure.
In XploRe we find the quantlets
linregfs
and
linregfs2
for
forward selection,
linregbs
for backward elimination, and
linregstep
for stepwise selection. Whereas
linregfs
only
returns the selected regressors Xi , the regression coefficients
and the p-values, the other three quantlets report each step, the ANOVA and
the parameter tables. Because both the syntax and the output formats of these
three quantlets are the same, we will only illustrate one of them with an example.
We use the data set generated above of the model
to demonstrate the usage of stepwise elimination. Before computing the regression,
we need to store the names of the variables in a column vector:
colname=string("X%.f",1:cols(x))
; sets the column names to X1,...,X5
{beta,se,betastan,p} = linregstep(x,y,colname)
; computes the stepwise selection
linregstep
returns the same values as
linreg.
It shows the following output:
Contents of EnterandOut
Stepwise Regression
-------------------
F-to-enter 5.19
probability of F-to-enter 0.96
F-to-remove 3.26
probability of F-to-remove 0.90
Variables entered and dropped in the following Steps:
Step Multiple R R^2 F SigF Variable(s)
1 0.9843 0.9688 248.658 0.000 In : X3
2 0.9992 0.9984 2121.111 0.000 In : X1
3 0.9999 0.9998 10572.426 0.000 In : X4
A N O V A SS df MSS F-test P-value
_____________________________________________________________
Regression 87.239 3 29.080 10572.426 0.0000
Residuals 0.017 6 0.003
Total Variation 87 9 9.695
Multiple R = 0.99991
R^2 = 0.99981
Adjusted R^2 = 0.99972
Standard Error = 0.05245
Contents of Summary
Variables in the Equation for Y:
PARAMETERS Beta SE StandB t-test P-value Variable
_____________________________________________________________
b[ 0,]= 2.0796 0.0742 0.0000 28.0417 0.0000 Constant
b[ 1,]= 1.9752 0.0630 0.1883 31.3494 0.0000 X1
b[ 2,]= -10.0622 0.0690 -0.9177 -145.7845 0.0000 X3
b[ 3,]= 0.4257 0.0626 0.0413 6.8014 0.0005 X4
First, the quantlet
linregstep
returns the
Fin values as
F-to-enter and
Fout as F-to-remove. Then each step is reported
and we obtain again the ANOVA and parameter tables described in the previous
section.
As expected,
linregstep
selects the variables
X1, X3 and X4 and estimates the model as
Recall the results of the previous ordinary regression.
We can see that the accuracy of the estimated parameters has
been improved by the selection method (especially for
).
In addition, we obtained the information as to which variables can
be ignored because the model does not depend on them.