1. Kernel Density Estimation

The goal of density estimation is to approximate the probability density function $f(\bullet)$ of a random variable X. Assume we have n independent observations $x_1,\ldots,x_n$ from the random variable X. The kernel density estimator $\widehat{f}_{h}(x)$ for the estimation of the density value f(x) at point x is defined as

\begin{displaymath}

\widehat{f}_{h}(x)=

\frac{1}{nh}\sum_{i=1}^{n}

K\left(\frac{x_i-x}{h}\right),

\end{displaymath} (1)

$K(\bullet)$ denoting a so-called kernel function, and h denoting the bandwidth. A number of possible kernel functions is listed in the following table.
Table 1: Kernel functions.
Kernel K(u) XploRe function
     
Uniform $\frac{1}{2} \,I(\vert u \vert \le 1)$ 1805 uni
Triangle $(1-\vert u \vert) \,I(\vert u \vert \le 1)$ 1808 trian
Epanechnikov $\frac{3}{4}(1-u^{2}) \,I(\vert u \vert \le 1)$ 1811 epa
Quartic $\frac{15}{16}(1-u^{2})^{2} \,I(\vert u \vert \le 1)$ 1814 qua
Triweight $\frac{35}{32}(1-u^{2})^{3} \,I(\vert u \vert \le 1)$ 1817 tri
Gaussian $\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}u^2)$ 1820 gau
Cosinus $\frac{\pi}{4}\cos(\frac{\pi}{2}u) \,I(\vert u \vert \le 1)$ 1823 cosi

In order to illustrate how the kernel density estimator works, let us rewrite equation (1) for the Uniform kernel:

\begin{eqnarray*}

\widehat{f}_{h}(x)

&=& \frac{1}{2nh}\,\sum_{i=1}^{n}

\,I\left...

...mber of observations } x_i \textrm{ that fall in } [x-h,x+h]\}.

\end{eqnarray*}


It is easy to see that for estimating the density at point x, the relative frequency of all observations xi falling in an interval around x is counted. The factor 1/(2nh) is needed to ensure that the resulting density estimate has integral $\int \widehat{f}_{h}(x)\,dx=1$.

The Uniform kernel implies that all observations xi in the neighborhood of x are given the same weight, since observations close to x carry also information about f(x). It is reasonable to attribute more weights to observations that are near to x and less weight to distant observations. All other kernels in Table 1 have this property. Figure 1 visualizes the construction of a kernel density estimate of 10 data points ( red + ) using the Gaussian kernel.

Figure: Kernel density estimate as a sum of bumps. 1834 smoo01.xpl
\includegraphics[scale=0.6]{smootherbumps.ps}


1.1 Computational Aspects

To understand the structure of the smoother library, we need to explain shortly how kernel estimates can be computed in practice. Consider the equation (1) again. To obtain the kernel density estimates for all observations $x_1,\ldots,x_n$ we must compute

\begin{displaymath}

\widehat{f}_{h}(x_j)=

\frac{1}{nh}\sum_{i=1}^{n}

K\left(\frac{x_i-x_j}{h}\right)\,, \quad j=1,\ldots,n.

\end{displaymath}

In the case of an exact computation of these expressions, the kernel function $K(\bullet)$ must be evaluated to $O(h\cdotp n^2)$ times. The expression $O(h\cdotp n^2)$ indicates a value proportional to n2 (the squared sample size). This increases the computation time if the sample size n is large.

For exploratory purposes, i.e. for graphing the density estimate it is not necessary to evaluate the $\widehat{f}_h(\bullet)$ at all observations $x_1,\ldots,x_n$. Instead, the estimate can be computed for example on an equidistant grid $v_1,\ldots,v_m$:

\begin{displaymath}v_k = x_{\min} + \frac{k}{m} \left(x_{\max}-x_{\min}\right),

\quad k=1,\ldots,m << n.\end{displaymath}

The evaluation of $\widehat{f}_h(v_k)$, $k=1,\ldots,m$, requires then only $O(h\cdotp n\cdotp m)$ steps.

The number of $O(h\cdotp n\cdotp m)$ evaluations of the kernel function $K(\bullet)$ is however time consuming if the sample size is large. An alternative and faster way is to approximate the kernel density estimate by the WARPing method (Härdle and Scott; 1992).

The basic idea of WARPing (Weighted Average of Rounded Points) is the ``binning'' of the data in bins of length d starting at the origin x0=0. Each observation point is then replaced by the bincenter of the corresponding bin (that means in fact that each point is rounded to the precision given by d). A usual choice for d is to use h/5 or $(x_{\max}-x_{\min})/100$. In the latter case, the effective sample size r for the computation (the number of nonempty bins) can be at most 101.

For the WARPing method, the kernel function needs to be evaluated only at $\ell\cdotp d / h$, $\ell=1,...,s$, where s is the number of bins which contains the support of the kernel function. The calculation of the estimated density reduces then to

\begin{displaymath}\widetilde{f}_h(w_j) = \frac{n_j}{nh}

\sum_{i=1}^{r} n_i K\left\{\frac{(i-j)\cdotp d}{h}\right\}\,,

\quad j=1,\ldots,r

\end{displaymath}

computed on the grid $w_j = (j+0.5)\cdotp d$ (j integer) with ni, nj denoting the number of observations in the i-th and j-th bins, respectively. The WARPing approximation requires $O(h\cdotp r/d)$ evaluations of the kernel function and $O(n) +O(h\cdotp r/d)$ steps in total. This is considerably faster than the exact computation, when the sample size is large.

To have a choice between both the exact and the WARPing computation of kernel estimates and related functions, the smoother library offers most functionality in two functions:

Functionality Exact WARPing
density estimate 1843 denxest 1846 denest
density confidence intervals 1849 denxci 1852 denci
density confidence bands 1855 denxcb 1858 dencb
density bandwidth selection - 1861 denbwsel


1.2 Computing Kernel Density Estimates


fh = 1875 denest (x {,h {,K} {,d}})
computes the kernel density estimate on a grid using the WARPing method
fh = 1878 denxest (x {,h {,K} {,v}})
computes the kernel density estimate for all observations or on a grid v by exact computation

The easiest way to compute kernel density estimates is the function 1881 denest. This function is based on the WARPing method which makes the computation very fast. Let us apply the 1884 denest routine to the nicfoo data which contain observations on household netincome in the first column (see Data Sets). 1891 denest requires the data as a minimum input. Optionally, the bandwidth h, the kernel function K and the discretization parameter d which is used for the WARPing can be specified. For example,


  netinc=read("nicfoo")

  netinc=netinc[,1]

  h=(max(netinc)-min(netinc))*0.15

  fh=denest(netinc,h)

1895smoo02.xpl

calculates the density estimate using 15% of the range of the data as bandwidth h. If h is not given, the rule of thumb bandwidth (6) is used, see Subsection 1.4 for more information. Note that we did not specify the kernel function in the above code. Therefore the default Quartic kernel "qua" and the default binsize d are used.

To display the estimated function you may use these lines of code:


  fh=setmask(fh,"line")

  plot(fh)

  tl="Density Estimate"

  xl="netincome"

  yl="density fh"

  setgopt(plotdisplay,1,1,"title",tl,"xlabel",xl,"ylabel",yl)

The resulting density is plotted in Figure 2.

Figure 2: Kernel density estimate for the Netincome data.
\includegraphics[scale=0.6]{smootherdens1.ps}

As an alternative to 1901 denest, the function 1904 denxest can be used for obtaining a density estimate by exact computation. As with 1907 denest the data are the minimum input. Optionally, the bandwidth h, the kernel function K and a grid v for the computation can be specified. The following code computes the density estimate for all observations:


  h=(max(netinc)-min(netinc))*0.15

  fh=denxest(netinc,h)

To compute the density on a grid of 30 equidistant points within the range of x, the previous example can be modified as follows:

  h=(max(netinc)-min(netinc))*0.15

  v=grid( min(netinc), (max(netinc)-min(netinc))/29, 30)

  fh=denxest(netinc,h,"qua",v)

1911smoo03.xpl

Finally, let us calculate the density estimate for a family of different bandwidths in order to see the dependence on the bandwidth parameter. We modify our first code in the following way:


  hfac=(max(netinc)-min(netinc))

  fh =setmask( denest(netinc,0.15*hfac) ,"line")

  fh1=setmask( denest(netinc,0.05*hfac) ,"line","thin","magenta")

  fh2=setmask( denest(netinc,0.10*hfac) ,"line","thin","red")

  fh3=setmask( denest(netinc,0.20*hfac) ,"line","thin","blue")

  fh4=setmask( denest(netinc,0.25*hfac) ,"line","thin","green")

  plot(fh1,fh2,fh3,fh4,fh)

  tl="Family of Density Estimates"

  xl="netincome"

  yl="density fh"

  setgopt(plotdisplay,1,1,"title",tl,"xlabel",xl,"ylabel",yl)

1917smoo04.xpl

As you can see from the resulting graph (Figure 3), smaller bandwidths reveal more structure of the data, whereas larger bandwidths give smoother functions.

Figure 3: Family of kernel density estimates for the Netincome data.
\includegraphics[scale=0.6]{smootherdens2.ps}


1.3 Kernel Choice


h2 = 1932 canker (h1, K1, K2)
does the canonical bandwidth transformation of a bandwidth value for kernel K1 into an equivalent bandwidth for kernel K2

Kernel density estimation requires two parameters: the kernel function K and the bandwidth parameter h. In practice, the choice of the kernel is not nearly as important as the choice of the kernel. The theoretical background of this observation is that kernel functions can be rescaled such that the difference between two kernel density estimates using two different kernels is almost negligible (Marron and Nolan; 1988).

Instead of rescaling kernel functions, we can also rescale the bandwidths. This rescaling works as follows: Suppose we have estimated an unknown density f using some kernel KA and bandwidth hA and we are considering estimating f with a different kernel, say KB. Then the appropriate bandwidth hB to use with KB is

\begin{displaymath}

h_B=\frac{h^{A}}{\delta_{0}^{A}}\,,

\end{displaymath} (2)

where the scale factors $\delta_{0}^{A}$ (the so-called canonical bandwidths) can be found in Table 2 for some selected kernel functions.


Table 2: $\delta _{0}$ for different kernels.
Kernel XploRe function   $\delta _{0}$  
       
Uniform1935 uni $\left(\frac{9}{2}\right)^{1/5}$ $\cong$ 1.3510
Epanechnikov1938 epa 151/5 $\cong$ 1.7188
Quartic1941 qua 351/5 $\cong$ 2.0362
Gaussian1944 gau $\left(\frac{1}{4\pi}\right)^{1/10}$ $\cong$ 0.7764

XploRe offers the function 1951 canker to do this recomputation of bandwidths:


  hqua=canker(1.0,"gau","qua")

  hqua

computes the bandwidth

  Contents of hqua

  [1,]   2.6226

that can be used with the Quartic kernel "qua" in order to find the kernel density estimate that is equivalent to using the Gaussian kernel "gau" and bandwidth 1.0.


1.4 Bandwidth Selection


hrot = 1965 denrot (x {,K {,opt}})
computes a rule-of-thumb bandwidth for density estimation
{hcrit,crit} = 1968 denbwsel(x {,h {,K {,d}}})
starts an interactive tool for kernel density bandwidth selection using the WARPing method

Now we can consider the problem of bandwidth selection. The optimal bandwidth for a kernel density estimate is typically calculated on the basis of an estimate for the integrated squared error

\begin{displaymath}ISE(\widehat{f}_h) = \int \{\widehat{f}_h(x)-f(x)\}^2\,dx\end{displaymath}

or the mean integrated squared error

\begin{displaymath}MISE(\widehat{f}_h) = \int E \{\widehat{f}_h(x)-f(x)\}^2\,dx.\end{displaymath}

Both criteria should be minimized to obtain a good approximation of the unknown density f. For ISE, it is easy to see that

\begin{displaymath}

ISE=\{\widehat f_{h}\}

\int \widehat{f}^{2}_{h}(x)\,dx - 2 \int

\{\widehat{f}_{h}f\}(x)\,dx + \int f^{2}(x)\,dx.

\end{displaymath} (3)

Since the third term does not depend on h, the classical least squares cross-validation criterion (LSCV) is an estimate for the first and second term of (3). Cross-validation variants that are based on MISE are the biased cross-validation criterion (BCV), the smoothed cross-validation criterion (SCV) and the Jones, Marron and Park cross-validation criterion (JMP).

Plug-in approaches attempt to estimate MISE. Under the asymptotic conditions $h\to 0$ and $nh\to\infty$, it is possible to approximate MISE by

\begin{displaymath}

MISE(\widehat f_{h})

\mathrel{\mathop{\approx}\limits_{\t...

...ac{h^{4}}{4} \{\mu_{2}(K)\}^{2} \, \Vert f'' \Vert^{2}_{2}\,.

\end{displaymath} (4)

The terms $\Vert K \Vert ^{2}_{2}$ and $\{\mu_{2}(K)\}^{2}$ are constants depending on the kernel function K. Analogously, $\Vert f'' \Vert^{2}_{2}$ denotes a constant depending on the second derivative of the unknown density f. Optimizing (4) with respect to h yields the following optimal bandwidth

\begin{displaymath}

h_{\textrm{opt}}=\left( \frac{\Vert K \Vert^{2}_{2}}{\Vert

f'' \Vert^{2}_{2}\,\{\mu_{2}(K)\}^{2}n}\right)^{1/5}\,.

\end{displaymath} (5)

$\Vert f'' \Vert^{2}_{2}$ is the only unknown term in (5). The idea behind plug-in estimates is to replace f'' by an estimate. Silverman's rule of thumb computes f'' as if f where the density of the normal distribution $N(\mu,\sigma^2)$. This gives

\begin{displaymath}\widehat{h}_0 = \left( \frac{4 \widehat{\sigma}^5}{3n} \right)^{1/5}

\cong 1.06\,\widehat{\sigma}\,n^{-1/5}\end{displaymath}

if the kernel function K is also assumed to be the Gaussian kernel.

If a different kernel function is used, the bandwidth $\widehat{h}_0$ can be rescaled by using equation (2). For example, the rule-of-thumb bandwidth for the Quartic kernel computes as

\begin{displaymath}

\widehat{h}_0 = 2.78\,\widehat{\sigma}\,n^{-1/5}

\end{displaymath} (6)

This rule of thumb is implemented in 1976 denrot.

Of course, $\widehat{h}_0$ is a very crude estimate, in particular if the true density f is far from looking like a normal density. Refined plug-in methods such as the Park and Marron plug-in and the Sheater and Jones plug-in use nonparametric estimates for f''.

The quantlet 1979 denbwsel needs the data vector as input:


  netinc=read("nicfoo")

  netinc=netinc[,1]

  tmp=denbwsel(netinc)

1983smoo05.xpl

This opens a selection box which offers you the choice between the different bandwidth selectors:

1990

Figure 4: Least Squares CV bandwidth selector for Netincome data.
\includegraphics[scale=0.6]{smootherdenlscv.ps}

Choose for instance the Least Squares CV item to use LSCV. Here, a graphical display appears which shows the LSCV criterion in the upper left, the chosen optimal bandwidth in the upper right, the resulting kernel density estimate in the lower left, and some information about the search grid and the kernel in the lower right. This graphical display is shown in Figure 4.

You can play around now with the different bandwidth selectors. Try out the Park and Marron plug-in for example. Here, in contrast to the cross-validation methods, the aim is to find the root of the bandwidth selection criterion. The graphical display for the Park and Marron selector is shown in Figure 5.

Figure 5: Park and Marron plug-in for Netincome data.
\includegraphics[scale=0.6]{smootherdenpm.ps}

During the use of 1994 denbwsel, all currently obtained results are displayed in the output window. Note that for some selectors, it will be necessary to increase the search grid for h. This can be done by using the menu item (Change h range or K).

It is difficult to give a general decision of which bandwidth selector is the best among all. As mentioned above, Silverman's rule of thumb works well when the true density is approximately normal. It may oversmooth for bimodal or multimodal densities. A nice feature of the cross-validation methods is that the selected bandwidth automatically adapts to the smoothness of f. From a theoretical point of view, the BCV and SCV bandwidth selectors have the optimal asymptotic rate of convergence, but need either a very large sample size or pay with a large variance. As a consequence, we recommend determining bandwidths by different selection methods and comparing the resulting density estimates.

For detailed overviews on plug-in and cross-validation methods of bandwidth selection, see Härdle (1991), Park and Turlach (1992), and Turlach (1993). Practical experiences are available from the simulation studies in Marron (1989), Jones et al. (1992), Park and Turlach (1992), and Cao et al. (1992).


1.5 Confidence Intervals and Bands


{fh, fhl, fhu} = 2015 denci (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and pointwise confidence intervals on a grid using the WARPing method
{fh, fhl, fhu} = 2018 dencb (x {,h {,alpha {,K} {,d}}})
computes the kernel density estimate and uniform confidence bands on a grid using the WARPing method
{fh, fhl, fhu} = 2021 denxci (x {,h {,alpha {,K} {,v}}})
computes the kernel density estimate and pointwise confidence intervals for all observations or on a grid v by exact computation
{fh, fhl, fhu} = 2024 denxcb (x {,h {,alpha {,K} {,v}}})
computes the kernel density estimate and uniform confidence bands for all observations or on a grid v by exact computation

To derive confidence intervals or confidence bands for $\widehat{f}_{h}(x)$ one has to know its sampling distribution. The following result about the asymptotic distribution of $\widehat{f}_{h}(x)$ is known from the literature (Härdle; 1991, p. 62). Suppose that f'' exists and h=cn-1/5. Then

\begin{displaymath}

n^{2/5} \left\{\widehat{f}_{h}(x)-f(x)\right\}\

\mathrel{\m...

...\mu_{2}(K)\,,\,

\frac{1}{c}f(x)\Vert K \Vert^{2}_{2} \right),

\end{displaymath} (7)

which can be recalculated to

\begin{displaymath}

\left\{\widehat{f}_{h}(x)-f(x)\right\}\

\mathrel{\mathop{\s...

...{2}(K)\,,\,

\frac{1}{{nh}} f(x)\Vert K \Vert^{2}_{2} \right).

\end{displaymath} (8)

Assuming that the bias term $\frac{ h^{2} }{2} f''(x)\mu_{2}(K)$ is negligible with respect to the standard deviation, the resulting confidence interval for f(x) can be written as

\begin{displaymath}

\left[\;\widehat{f}_{h}(x)-z_{1-\frac{\alpha}{2}}\sqrt{\frac...

...frac{\widehat{f}_{h}(x)\Vert K

\Vert _{2}^{2}}{nh}}\;\right].

\end{displaymath} (9)

The value $z_{1-\frac{\alpha}{2}}$ is the $(1-\frac{\alpha}{2})$-quantile of the standard normal distribution.

The functions 2037 denci and 2040 denxci compute pointwise confidence intervals for a grid of points (2043 denci using WARPing) or all observations (2046 denxci using exact computation). The following quantlet code computes the confidence intervals for bandwidth 0.2 (Gaussian kernel) and significance level $\alpha =0.05$. For graphical exploration, the pointwise intervals are drawn as curves and displayed in a plot. The resulting graphics is shown in Figure 6.


  {fh,fl,fu}=denci(netinc,0.2,0.05,"gau")

  fh=setmask(fh,"line","blue")

  fl=setmask(fl,"line","blue","thin","dashed")

  fu=setmask(fu,"line","blue","thin","dashed")

  plot(fl,fu,fh)

  setgopt(plotdisplay,1,1,"title","Density Confidence Intervals")

2050smoo06.xpl

Figure 6: Pointwise confidence intervals.
\includegraphics[scale=0.6]{smootherdci.ps}

Uniform confidence bands for f can only be derived under some rather restrictive assumptions, see Härdle (1991, p. 65). Suppose that f is a density on [0,1] and $h=n^{-\delta}$, $\delta

\in (\frac{1}{5},\frac{1}{2})$. Then the following formula holds under some regularity for all $x \in [0,1]$ (Bickel and Rosenblatt; 1973):

\begin{displaymath}

P\Bigg(\widehat{f}_{h}(x)- z_{n,\alpha}

\sqrt{\frac{\widehat...

...hrel{\mathop{\sim}\limits_{\textrm{\small as.}}^{}} 1-\alpha,

\end{displaymath} (10)

where

\begin{displaymath}z_{n,\alpha}=\left\{\frac{-\log\{-\frac12 \log(1-\alpha)\}}{(2\delta

\log{n})^{1/2}}+d_{n}\right\}^{1/2}\end{displaymath}

and

\begin{displaymath}

d_{n}=(2\delta\log{n})^{1/2}+(2\delta\log{n})^{-1/2}

\log{\l...

...pi}\frac{\Vert K'\Vert _{2}}{\Vert K

\Vert _{2}}\right)}^{1/2}.\end{displaymath}

In practice, the data $x_1,\ldots,x_n$ are transformed to the interval [0,1], then the confidence bands are computed and rescaled to the original scale of $x_1,\ldots,x_n$. The functions 2061 dencb and 2064 denxcb can hence be directly applied to the original data $x_1,\ldots,x_n$. The following quantlet code computes the confidence bands for the netinc data and displays them in a plot:

  {fh,fl,fu}=dencb(netinc,0.2,0.05,"gau")

  fh=setmask(fh,"line","blue")

  fl=setmask(fl,"line","blue","thin","dashed")

  fu=setmask(fu,"line","blue","thin","dashed")

  plot(fl,fu,fh)

  setgopt(plotdisplay,1,1,"title","Density Confidence Bands")

2068smoo07.xpl

The resulting graph is shown in Figure 7. It can be clearly seen that the confidence bands are wider than the confidence intervals (Figure 6). This is explained by the fact that here the whole function $f(\bullet)$ lies within the bands with probability 95%.

Figure 7: Uniform confidence bands.
\includegraphics[scale=0.6]{smootherdcb.ps}



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de