2. Kernel Regression

Regression smoothing investigates the association between an explanatory variable X and a response variable Y. This section explains how to apply Nadaraya-Watson and local polynomial kernel regression.

Nonparametric regression aims to estimate the functional relation between Y and X, i.e. the conditional expectation

E(Y|X)=m(X).

The function $m(\bullet)$ is the regression function to estimate. Alternatively, we can write

\begin{displaymath}Y= m(X) + \varepsilon, \quad E(\varepsilon) = 0.\end{displaymath}

It is not necessary that the variance of Y is a constant function. Typically one assumes

\begin{displaymath}

\textrm{Var}(Y\vert X) = \sigma^2(X),

\end{displaymath} (11)

where $\sigma(\bullet)$ is a continuous and bounded function.

Suppose that we have independent observations $(x_1,y_1),\ldots,(x_n,y_n)$. The Nadaraya-Watson estimator is defined as

\begin{displaymath}\widehat m_{h}(x)=

\frac{\sum\limits_{i=1}^n K\left(\frac{\di...

...K\left(\frac{\displaystyle x_{i}-x}{\displaystyle h}\right)}\,.\end{displaymath}

It is easy to see that this kernel regression estimator is just a weighted sum of the observed responses yi. The denominator ensures that the weights sum up to 1. The kernel function $K(\bullet)$ can be any kernel function from Table 1.


2.1 Computational Aspects

The computational effort for calculating a Nadaraya-Watson or local polynomial regression is in the same order as for kernel density estimation (see Section 1.1). As in density estimation, all routines are offered in an exact and in a WARPing version:

Functionality Exact WARPing
Nadaraya-Watson regression 2081 regxest 2084 regest
Nadaraya-Watson confidence intervals 2087 regxci 2090 regci
Nadaraya-Watson confidence bands 2093 regxcb 2096 regcb
Nadaraya-Watson bandwidth selection 2099 regxbwsel 2102 regbwsel
local polynomial regression 2105 lpregxest 2108 lpregest
local polynomial derivatives 2111 lpderxest 2114 lpderest


2.2 Computing Kernel Regression Estimates


mh = 2128 regest (x {,h {,K} {,d}})
computes the kernel regression on a grid using the WARPing method
mh = 2131 regxest (x {,h {,K} {,v}})
computes the kernel regression estimate for all observations or on a grid v by exact computation

The WARPing-based function 2134 regest offers the fastest way to compute the Nadaraya-Watson regression estimator for exploratory purposes. We apply this routine nicfoo data, which contain observations on household netincome in the first column and on food expenditures in the second column. The following quantlet computes and plots the regression curve together with the data:


  nicfoo=read("nicfoo")

  h=0.2*(max(nicfoo[,1])-min(nicfoo[,1]))

  mh=regest(nicfoo,h)

  mh=setmask(mh,"line","blue")

  xy=setmask(nicfoo,"cross","small")

  plot(xy,mh)

  setgopt(plotdisplay,1,1,"title","regression estimate")

2140smoo08.xpl

The function 2145 regest needs to be replaced by 2148 regxest, if one is interested in the fitted values for all observations $x_1,\ldots,x_n$. The following quantlet code computes the regression function on all observations and shows a residual plot together with the zero line:

  mh=regxest(nicfoo,0.2)

  res=nicfoo[,1] ~ (nicfoo[,2]-mh[,2])

  res=setmask(res,"cross")

  zline=(min(nicfoo[,1])|max(nicfoo[,1])) ~ (0|0)

  zline=setmask(zline,"line","red")

  plot(res,zline)

  setgopt(plotdisplay,1,1,"title","regression residuals")

2152smoo09.xpl

Figure 8: Regression estimate for the netincome vs. food expenditures data.
\includegraphics[scale=0.6]{smootherreg1.ps}

Figure 9: Residuals of kernel regression.
\includegraphics[scale=0.6]{smootherreg2.ps}

The resulting regression function is shown in Figure 8. Figure 9 shows the resulting residual plot. We observe that most of the nonlinear structure of the data is captured by the nonparametric regression function. However, the residual graph shows that the data are heteroskedastic, in the way that the residual variance increases with increasing netincome.


2.3 Bandwidth Selection


{hcrit,crit} = 2169 regbwsel ()
starts an interactive tool for kernel regression bandwidth selection using the WARPing method
{hcrit,crit} = 2172 regxbwsel ()
starts an interactive tool for kernel regression bandwidth selection using exact computation

As in kernel density estimation, kernel regression involves choosing the kernel function and the bandwidth parameter. One observes the same phenomenon as in kernel density estimation here: The difference between two kernel functions K is almost negligible when the bandwidths are appropriately rescaled. To make the bandwidths for two different kernels K comparable, the same technique as described in Subsection 1.3 can be used.

Consequently, we now concentrate on the problem of bandwidth selection. In the regression case, typically the averaged squared error

\begin{displaymath}

\textrm{ASE}(h) = \sum_{i=1}^n \left\{m(x_i)-\widehat{m}_h(x_i)

\right\}^2

\end{displaymath} (12)

is used as the criterion for goodness of fit. A simple estimator for ASE would be to replace the unknown function values m(xi) by the observations yi, which yields the residual sum of squares

\begin{displaymath}

\textrm{RSS}(h) = \sum_{i=1}^n \left\{y_i-\widehat{m}_h(x_i)

\right\}^2.

\end{displaymath}

However, in the case of a nonparametric kernel regression, the values $\widehat{m}_h(x_i)$ approach yi when the bandwidth h approaches zero. A more appropriate estimation of ASE is obtained by the cross-validation criterion

\begin{displaymath}

\textrm{CV}(h)= \sum_{i=1}^n \left\{y_i-\widehat{m}_{h,-i}(x_i)

\right\}^2,

\end{displaymath} (13)

where $\widehat{m}_{h,-i}(\bullet)$ denotes the kernel regression estimate which is obtained without using the i-th observation (xi,yi). Some calculation shows (Härdle; 1990) that CV(h) can be written as

\begin{displaymath}

\textrm{CV}(h) = \sum_{i=1}^n \{y_{i}-\widehat{m}_{h}(x_{i})\}^2

\,\Xi\left(W_{hi}(x_{i})\right)

\end{displaymath} (14)

with $\Xi(u)=(1-u)^{-2}$ and

\begin{displaymath}

W_{hi}(x_{i})= \frac{K(0)}{\sum\limits_{j=1}^n K(\frac{x_{i}-x_{j}}{h})}.

\end{displaymath}

Therefore, the cross-validation can also be interpreted as a residual sum of squares (RSS), where small values for h are penalized by means of including the function $\Xi$. The function $\Xi(\bullet)$ coincides with the penalty function that Craven and Wahba (1979) proposed for their generalized cross-validation criterion. Table 3 lists some more penalty functions.


Table 3: Penalizing functions.
Criterion Penalty Function
   
(Generalized) Cross-validation $\Xi _{\textrm{GCV}}\ (u)=(1-u)^{-2}$
Shibata's model selector $\Xi _{\textrm{S}}(u)=1+2u$
Akaike's Information Criterion $\Xi _{\textrm{AIC}}(u)=\exp\,(2u)$
Akaike's Finite Prediction Error $\Xi _{\textrm{FPE}}(u)=(1+u)/(1-u) $
Rice's T $\Xi _{\textrm{T}}(u)=(1-2u)^{-1} $

All the mentioned penalty functions have the same asymptotic properties. In finite samples, however, the functions differ in the relative weight they give to variance and bias of $\widehat{m}_{h}(x)$. Rice's T gives the most weight to variance reduction while Shibata's model selector stresses bias reduction the most.

In XploRe, all penalizing functions can be applied via the functions 2179 regbwsel and 2182 regxbwsel. As can be seen from (13), criteria like CV need to be evaluated at all observations. Thus, the function 2185 regxbwsel which uses exact computations is to be preferred here. 2188 regbwsel uses the WARPing approximation and may select bandwidths far from the optimal, if the discretization binwidth d large. Note that both 2191 regbwsel and 2194 regxbwsel may suffer from numerical problems if the studied bandwidths are too small.

An example for calling 2197 regxbwsel gives the following quantlet which uses the nicfoo data again:


  nicfoo=read("nicfoo")

  tmp=regxbwsel(nicfoo)

2203smoo10.xpl

At first, a selection box opens which offers the choice between the different penalty functions. Let us choose the first item Cross Validation which optimizes the CV criterion (13). A graphical display appears which shows the CV criterion in th upper left, the chosen optimal bandwidth in the upper right, the resulting kernel regression in the lower left, and information about the search grid and the kernel in the lower right.

Figure 10 shows this graphical display. The menu now allows the modification of the search grid and the kernel or the usage of other bandwidth selectors.


2210

Figure 10: Cross-validation bandwidth selector for Netincome vs. Food data.
\includegraphics[scale=0.6]{smootherregcv.ps}


2.4 Confidence Intervals and Bands


{mh, mhl, mhu} = 2231 regci (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and pointwise confidence intervals on a grid using the WARPing method
{mh, mhl, mhu} = 2234 regcb (x {,h {,alpha {,K} {,d}}})
computes the kernel regression and uniform confidence bands on a grid using the WARPing method
{mh, mhl, mhu} = 2237 regxci (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and pointwise confidence intervals for all observations or on a grid v by exact computation
{mh, mhl, mhu} = 2240 regxcb (x {,h {,alpha {,K} {,d}}})
computes the kernel regression and uniform confidence bands for all observations or on a grid v by exact computation

As in the case of density estimation, it can be shown that the regression estimates $\widehat{m}_{h}(x)$ have an asymptotic normal distribution. Suppose that m and f (the density of the explanatory variable X) are twice differentiable, and that h = c n-1/5. Then

\begin{displaymath}

n^{2/5} \left\{\widehat{m}_{h}(x)-m(x)\right\}

\mathrel{\mat...

...{c}\,\frac{\sigma^2(x)}{f(x)}\,\Vert K \Vert^{2}_{2} \right),

\end{displaymath} (15)

with

\begin{displaymath}b(x)=m''(x)+\frac{m'(x)f'(x)}{f'(x)}. \end{displaymath}

Thus, similar to the density case, an approximate confidence interval can be obtained by

\begin{displaymath}

\left[\widehat{m}_h (x) - z_{1-\frac{\alpha}{2}}

\sqrt{\fra...

..._h^2 (x)\Vert K \Vert^2_2 }{nh\widehat{f}_h(x)}}\, \right]\,,

\end{displaymath} (16)

where $z_{1-\frac{\alpha}{2}}$ is the $(1-\frac{\alpha}{2})$-quantile of the standard normal distribution and the estimate of the variance $Var(Y\vert x)= \sigma^2(x)$ is given by

\begin{displaymath}\widehat{\sigma}_h^2(x)=

\frac{\frac{1}{n} \sum\limits_{i=1}^...

...)\}^2}

{\sum\limits_{i=1}^n K\left(\frac{x_{i}-x}{h}\right)}\,.\end{displaymath}

Also similar to the density case, uniform confidence bands for $m(\bullet)$ need rather restrictive assumptions (Härdle; 1990, p. 116). Suppose that f is a density on [0,1] and $h=n^{-\delta}$, $\delta

\in (\frac{1}{5},\frac{1}{2})$. Then it holds under some regularity for all $x \in [0,1]$:

\begin{eqnarray*}

\lefteqn{P\Bigg(\widehat{m}_{h}(x)- z_{n,\alpha}

\sqrt{\frac...

...hrel{\mathop{\approx}\limits_{\textrm{\small as.}}^{}} 1-\alpha,

\end{eqnarray*}


where

\begin{displaymath}z_{n,\alpha}=\left\{\frac{-\log\{-\frac12 \log(1-\alpha)\}}{(2\delta

\log{n})^{1/2}}+d_{n}\right\}^{1/2},\end{displaymath}


\begin{displaymath}

d_{n}=(2\delta\log{n})^{1/2}+(2\delta\log{n})^{-1/2}

\log{\l...

...pi}\frac{\Vert K'\Vert _{2}}{\Vert K

\Vert _{2}}\right)}^{1/2}.\end{displaymath}

Pointwise confidence intervals and uniform confidence bands using the WARPing approximation are provided by 2253 regci and 2256 regcb, respectively. The equivalents for exact computations are 2259 regxci and 2262 regxcb. The functions 2265 regcb and 2268 regxcb can be directly applied to the original data $x_1,\ldots,x_n$, the transformation to [0,1] is performed internally. The following quantlet code extends the above regression function by confidence intervals and confidence bands:


  {mh,mli,mui}=regci(nicfoo,0.18)  ; intervals

  {mh,mlb,mub}=regcb(nicfoo,0.18)  ; bands

  mh =setmask(mh,"line","blue","thick")

  mli=setmask(mli,"line","blue","thin","dashed")

  mui=setmask(mui,"line","blue","thin","dashed")

  mlb=setmask(mlb,"line","blue","thin")

  mub=setmask(mub,"line","blue","thin")

  plot(mh,mli,mui,mlb,mub)

  setgopt(plotdisplay,1,1,"title","Confidence Intervals & Bands")

2272smoo11.xpl

Both confidence intervals and bands were computed using bandwidth 0.2. The significance level and kernel are not specified, as the defaults 0.05 and "qua" are used. The resulting plot is shown in Figure 11.

Figure 11: Pointwise confidence intervals (thin dashed) and uniform confidence (thin solid) bands in regression.
\includegraphics[scale=0.6]{smootherrcb.ps}


2.5 Local Polynomial Regression and Derivative Estimation


mh = 2289 lpregest (x {,h {,p {,K} {,d}}})
computes the local polynomial kernel regression on a grid using the WARPing method
mh = 2292 lpregxest (x {,h {,p {,v}}})
computes the local polynomial kernel regression estimate for all observations or on a grid v by exact computation

Note that the Nadaraya-Watson estimator is a local constant estimator, i.e. the solution of

\begin{displaymath}\min_{\beta_{0}} \sum_{i=1}^{n} \left\{ y_{i} -

\beta_{0} \right\}^{2}

K\left(\frac{x_i-x}{h}\right).\end{displaymath}

Replacing $\beta_{0}$ by a p-th order polynomial in xi-x yields the local polynomial kernel regression estimator. For details, see Fan and Gijbels (1996). Let us illustrate this with the example of a local linear regression estimate (p=1). The minimization problem is

\begin{displaymath}\min_{\beta_{0},\beta_{1}}

\sum_{i=1}^{n} \left\{ y_{i} - \...

...\beta_{1}(x_{i}-x) \right\}^{2}

K\left(\frac{x_i-x}{h}\right).\end{displaymath}

The solution of the problem can be written as

\begin{displaymath}

\widehat{\beta} = (\widehat{\beta}_{0},

\widehat{\beta}_{1}...

...{\cal W} {\cal X}

\right)^{-1} {\cal X}^{T} {\cal W} {\cal Y}

\end{displaymath} (17)

using the notations

\begin{displaymath}{\cal X}= \left(\begin{array}{cc}

1&(x_{1}-x)\\

\vdots&\v...

...begin{array}{c}

y_{1}\\ \vdots \\ y_{n}

\end{array}\right),

\end{displaymath}

and ${\cal W}=\textrm{diag}\left\{K\left(\frac{x_{1}-x}{h}\right),

\ldots,K\left(\frac{x_{n}-x}{h}\right)\right\}$. In (17), $\widehat{m}_h(x)=\widehat{\beta}_{0}$ estimates the regression function itself, whereas $\widehat{m}_{1,h}(x)=\widehat{\beta}_{1}$ estimates the first derivative of the regression function.

The functions 2295 lpregest and 2298 lpregxest for local polynomial regression have essentially the same input as their Nadaraya-Watson equivalents, except that an additional parameter p to specify the degree of the polynomial can be given. For local polynomial regression, an odd value of p is recommended since odd-order local polynomial regressions outperform even-order local polynomial regressions.

Derivatives of regression functions are computed with 2301 lpderest or 2304 lpderxest. For derivative estimation a polynomial order whose difference to the derivative order is odd should be used. Typically one uses the p=1 (local linear) for the estimation of the regression function and p=2 (local quadratic) for the estimation of its first derivative.

2307 lpregxest and 2310 lpderxest use automatically local linear and local quadratic estimation if no order is specified. The default kernel function is the Quartic kernel "qua". Appropriate bandwidths can be found by means of rule of thumbs that replace the unknown regression function by a higher-order polynomial (Fan and Gijbels; 1996). The following quantlet code estimates the regression function and its first derivative by the local polynomial method. Both functions and the data are plotted together in Figure 12.


  motcyc=read("motcyc")

  hh=lpregrot(motcyc)     ; rule-of-thumb bandwidth

  hd=lpderrot(motcyc)     ; rule-of-thumb bandwidth

  mh=lpregest(motcyc,hh)  ; local linear regression

  md=lpderest(motcyc,hd)  ; local quadratic derivative

  mh=setmask(mh,"line","black")

  md=setmask(md,"line","blue","dashed")

  xy=setmask(motcyc,"cross","small","red")

  plot(xy,mh,md)

  setgopt(plotdisplay,1,1,"title","Local Polynomial Estimation")

2314smoo12.xpl

Figure 12: Local linear regression (solid), derivative (dashed) estimate and data.
\includegraphics[scale=0.6]{smootherlld.ps}



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de