2. Multiple Linear Regression


{b,bse,bstan,bpval} = 910 linreg ({x, y {,opt,om}})
estimates the coefficients $\beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table
{xs,bs,pvalue} = 913 linregfs (x, y {,alpha})
computes the forward selection and estimates the coefficients $\beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table
{b,bse,bstan,bpval} = 916 linregfs2 (x, y, colname {,opt})
computes the forward selection and estimates the coefficients $\beta_0,\ldots,\beta_p$ for a linear problem from data x and y
{b,bse,bstan,bpval} = 919 linregbs (x, y, colname {,opt})
computes the backward elimination and estimates the coefficients $\beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table
{b,bse,bstan,bpval} = 922 linregstep (x, y, colname {,opt})
computes the stepwise selection and estimates the coefficients $\beta_0,\ldots,\beta_p$ for a linear problem from data x and y and calculates the ANOVA table

In this section, we consider the linear model

\begin{displaymath}Y=\beta_0+\beta_1 X_1+\ldots+\beta_p X_p\,.\end{displaymath}

Looking at this model we are faced with two problems: From the mathematical point of view, the second problem is about reducing the dimension of the model. Seen with the eyes of the user, it gives us information about building a parsimonious model. To this end, we have to find a way to handle the selection or removal of a variable.

We want to use a simulated data set to demonstrate the solution to these two problems. This example is stored in 925regr3.xpl where we generate five uniform [0,1] distributed variables $X_1,\ldots,X_5$. Only three of them influence Y:

\begin{displaymath}Y=2+2\,X_1-10\,X_3+0.5\,X_4+\varepsilon\,.\end{displaymath}

Here, $\varepsilon$ is a normally distributed error term.

  randomize(1)        ; sets a seed for the random generator

  eps=normal(10)      ; generates 10 standard normal errors

  x1=uniform(10)      ; generates 10 uniformly distributed values

  x2=uniform(10)

  x3=uniform(10)

  x4=uniform(10)

  x5=uniform(10)

  x=x1~x2~x3~x4~x5    ; creates the x data matrix

  y=2+2*x1-10*x3+0.5*x4+eps/10 ; creates y

  z=x~y               ; creates the data matrix z

  z                   ; returns z

This shows

  Contents of z

  [ 1,]  0.98028  0.35235  0.29969  0.85909  0.62176   1.3936 

  [ 2,]  0.83795  0.82747  0.13025  0.79595  0.59754   2.7269 

  [ 3,]  0.15873  0.93534  0.91259  0.72789  0.43156  -6.5193 

  [ 4,]  0.67269  0.67909  0.28156  0.20918  0.19878   0.69022 

  [ 5,]  0.50166  0.97112  0.39945  0.57865  0.19337  -0.66278 

  [ 6,]  0.94527  0.36003  0.77747  0.029797 0.40124  -3.9237 

  [ 7,]  0.18426  0.29004  0.24534  0.44418  0.35116   0.11605 

  [ 8,]  0.36232  0.35453  0.53022  0.4497   0.8062   -2.3026 

  [ 9,]  0.50832  0.00516  0.90669  0.16523  0.75683  -5.9188 

  [10,]  0.76022  0.17825  0.37929  0.093234 0.17747  -0.20187

Let us start with the first problem and use the the quantlet 928 linreg to estimate the parameters of the model

\begin{displaymath}Y=\beta_0+\beta_1X_1+\ldots+\beta_5X_5+\varepsilon\end{displaymath}


  {beta,bse,bstan,bpval}=linreg(x,y)  ; computes the linear

                                      ;    regression

produces

  A  N  O  V  A            SS    df     MSS     F-test   P-value

  ______________________________________________________________

  Regression             87.241   5    17.448  4700.763   0.0000

  Residuals               0.015   4     0.004

  Total Variation        87.255   9     9.695

  

  Multiple R      = 0.99991

  R^2             = 0.99983

  Adjusted R^2    = 0.99962

  Standard Error  = 0.06092

  

  

  PARAMETERS        Beta      SE     StandB     t-test   P-value

  ______________________________________________________________

  b[ 0,]=         2.0745    0.0941   0.0000     22.056   0.0000

  b[ 1,]=         1.9672    0.0742   0.1875     26.517   0.0000

  b[ 2,]=         0.0043    0.0995   0.0005      0.043   0.9677

  b[ 3,]=       -10.0887    0.0936  -0.9201   -107.759   0.0000

  b[ 4,]=         0.3991    0.1203   0.0387      3.318   0.0294

  b[ 5,]=         0.0708    0.1355   0.0053      0.523   0.6289

We obtain the ANOVA and parameter tables which return the same values as found in the previous section. Substituting the estimated parameters $\widehat\beta_0,\ldots,\widehat\beta_5$, we get with our generated data set

\begin{displaymath}\widehat Y(x)= 2.0745+1.9672x_1+0.0043x_2-10.0887x_3+0.3991x_4+0.0708x_5\,.\end{displaymath}

We know that X2 and X5 do not have any influence on Y. This is reflected by the fact that the estimates of $\beta_2$ and $\beta_5$ are close to zero. Now we reach the point where we are faced with our second problem, how to eliminate these variables. We can get a first impression by considering the parameter estimates and their t-values in the parameter table. A t-value is small if there is no influence of the corresponding variable. This is reflected in the p-value, which is the significance level for testing the hypothesis that a parameter equals zero. From the above table, we can see that only the p-values for the constant, X1, X3 and X4 are smaller than 0.05, the typical significance level for hypothesis testing. The p-values of X2 and X5 are much larger than 0.05 which means that they are not significantly different from zero.

The above way of choosing variables is convenient, but we want to know if the elimination or selection of a variable improves the result or not. This leads immediately to the stepwise model selection methods.

Let us first consider forward selection. The idea is to start from one ``good'' variable Xj and calculate the simple linear regression for

\begin{displaymath}Y=\beta_0+\beta_j X_j+\varepsilon\,.\end{displaymath}

Then we decide stepwise for each of the remaining variables if its inclusion to the model improves the fit of the model. The algorithm is:
FS1
Choose the variable Xj with the highest t- or F-value and calculate the simple linear regression.
FS2
Of the remaining variables, add the variable Xk which fulfills one of the three (equivalent) criteria below:
$\bullet$
Xk has the highest sample partial correlation.
$\bullet$
The model with Xk increases the R2-value the most.
$\bullet$
Xk has the highest t- or F-value.
FS3
Repeat FS2 until one of the stopping rules applies:
$\bullet$
The order p of the model has reached a predetermined p*.
$\bullet$
The F-value is smaller then a predetermined value Fin.
$\bullet$
Xk does not significantly improve the model fit.
A similar idea leads to backward elimination. We start with the linear regression for the full model and eliminate stepwise variables without influence.
BE1
Calculate the linear regression for the full model.
BE2
Eliminate the variable Xk with one of the following (equivalent) properties:
$\bullet$
Xk has the smallest sample partial correlation among all remaining variables.
$\bullet$
The removing of Xk causes the smallest change of R2.
$\bullet$
Of the remaining variables, Xk has the smallest t- or F-values.
BE3
repeat BE2 until one of the following stopping rules is valid:
$\bullet$
The order p of the model has reached a predetermined p*.
$\bullet$
The F-value is larger then a predetermined value Fout.
$\bullet$
Removing Xk does not significantly change the model fit.
A kind of compromise between forward selection and backward elimination is given by the stepwise selection method. Beginning with one variable just like in forward selection we have to choose one of the four alternatives:
1.
Add a variable.
2.
Remove a variable.
3.
Exchange two variables.
4.
Stop the selection.
This can be done with the following rules:
ST1
Add the variable Xk if one of the forward selection criteria FS2 is satisfied.
ST2
Remove the variable Xk with the smallest F-value if there are (possibly more than one) variables with an F-value smaller than Fout.
ST3
Remove the variable Xk with the smallest F-value if this removal results in a larger R2-value than it was obtained with the same number of variables before.
ST4
Exchange the variables Xk in the model and $X_\ell$ not in the model if this will increase the R2-value.
ST5
Stop the selection if neither ST1, ST2, ST3 nor ST4 is satisfied.
Remarks: In XploRe we find the quantlets 935 linregfs and 938 linregfs2 for forward selection, 941 linregbs for backward elimination, and 944 linregstep for stepwise selection. Whereas 947 linregfs only returns the selected regressors Xi , the regression coefficients $\beta_i$ and the p-values, the other three quantlets report each step, the ANOVA and the parameter tables. Because both the syntax and the output formats of these three quantlets are the same, we will only illustrate one of them with an example.

We use the data set generated above of the model

\begin{displaymath}Y\,=\,2+2X_1-10X_3+0.5X_4+\varepsilon\end{displaymath}

to demonstrate the usage of stepwise elimination. Before computing the regression, we need to store the names of the variables in a column vector:

  colname=string("X%.f",1:cols(x))  

; sets the column names to X1,...,X5

  {beta,se,betastan,p} = linregstep(x,y,colname) 

                       ; computes the stepwise selection

950 linregstep returns the same values as 953 linreg. It shows the following output:

  Contents of EnterandOut



  Stepwise Regression

  -------------------

  F-to-enter 5.19

  probability of F-to-enter 0.96

  F-to-remove 3.26

  probability of F-to-remove 0.90



  Variables entered and dropped in the following Steps:



  Step  Multiple R      R^2        F        SigF       Variable(s)

   1     0.9843       0.9688    248.658    0.000  In : X3

   2     0.9992       0.9984   2121.111    0.000  In : X1

   3     0.9999       0.9998  10572.426    0.000  In : X4


  A  N  O  V  A       SS      df     MSS       F-test   P-value

  _____________________________________________________________

  Regression        87.239     3    29.080   10572.426   0.0000

  Residuals          0.017     6     0.003

  Total Variation       87     9     9.695



  Multiple R      = 0.99991

  R^2             = 0.99981

  Adjusted R^2    = 0.99972

  Standard Error  = 0.05245



  Contents of Summary



  Variables in the Equation for Y:



  PARAMETERS    Beta    SE    StandB    t-test P-value Variable

  _____________________________________________________________

  b[ 0,]=     2.0796  0.0742  0.0000   28.0417 0.0000  Constant

  b[ 1,]=     1.9752  0.0630  0.1883   31.3494 0.0000  X1

  b[ 2,]=   -10.0622  0.0690 -0.9177 -145.7845 0.0000  X3

  b[ 3,]=     0.4257  0.0626  0.0413    6.8014 0.0005  X4

First, the quantlet 956 linregstep returns the Fin values as F-to-enter and Fout as F-to-remove. Then each step is reported and we obtain again the ANOVA and parameter tables described in the previous section.

As expected, 959 linregstep selects the variables X1, X3 and X4 and estimates the model as

\begin{displaymath}\widehat Y(x)= 2.0796+ 1.9752x_1- 10.0622x_3+ 0.4257x_4\,.\end{displaymath}

Recall the results of the previous ordinary regression. We can see that the accuracy of the estimated parameters has been improved by the selection method (especially for $\widehat\beta_4$). In addition, we obtained the information as to which variables can be ignored because the model does not depend on them.



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de