5. Examples


5.1 Simulated Example

Let us generate the observations

$\displaystyle
Y_i=f(X_i)+\epsilon_i, \, \, \, i=1,\ldots , 200,
$

where
$\displaystyle f(x_1,x_2)$ $\displaystyle =$ $\displaystyle 100 \, \, I(0 \leq x_1 \leq 0.5, 0.5 < x_2 \leq 1)$  
    $\displaystyle + 120 \, \, I(0.5 < x_1 \leq 1, 0.5 < x_2 \leq 1) ,$  

$ X_i$ are independently uniformly distributed on $ [0,1] \times [0,1]$, and $ \epsilon_i$ are independently standardly normally distributed. Figure 1 shows the data simulated from the function $ f$.

Figure 1: Plot of 100 simulated data from function $ f(x_1,x_2)$. The datapoints in the upper left (marked with crosses) are in the area of $ f(x_1,x_2)=100$, the datapoints in the upper right (marked with triangles) are in the area of $ f(x_1,x_2)=120$ and the datapoints in the lower part (marked with circles) are in the area of $ f(x_1,x_2)=0$.
\includegraphics[scale=0.6]{tutosim1.ps}

The quantlet for generating the observations is

  proc(y)=tuto(seed,n)
    randomize(seed)
    xdat=uniform(n,2)
    index=(xdat[,2]<=0.5)+(xdat[,2]>0.5).*(xdat[,1]<=0.5)*2
    layout=3*(index==1)+4.*(index==0)+5.*(index==2)
    ydat=100.*(index==2)+120.*(index==0)
    y=list(xdat,ydat,layout)
  endp

  library("xclust")
  d=createdisplay(1,1)
  data=tuto(1,100)
  x=data.xdat
  setmaskp(x, data.layout, data.layout, 8)
  show(d,1,1,x)
1275 cart01.xpl

Let us grow such a tree that the number of observations in a leaf nodes is less or equal to $ 5$ (mincut), the deviation in a leaf node is larger or equal 0 (mindev) and cut will be only done if the number of the resulting nodes is larger as $ 1$ (minsize). The type of variable is continuous.

  library("xclust")
  data=tuto(1,100)
  type=#(1,1)
  opt=cartsplitopt("minsize",1,"mindev",0,"mincut",5)
  tr=cartsplit(data.xdat,data.ydat,type,opt)
  totleaves=leafnum(tr,1)
  totleaves
  plotcarttree(tr)
1281 cart02.xpl

Figure 2 shows the regression tree tr with 41 leaves. From this figure, we prefer to choose the tree with 3 leaves because it is easier to see that in general it has three groups.

Figure 2: Initial regression tree for 100 simulated data from function $ f(x_1,x_2)$ (left). The total number of leaves ($ 41$) is shown at the right.
\includegraphics[scale=0.6]{tutosim2.ps}


Let us choose the tree with $ 3$ leaves with the following command.

  trfin=prunetot(tr,3)
  plotcarttree(trfin)
1288 cart03.xpl

Figure 3: Final regression tree for 100 simulated data from function $ f(x_1,x_2)$ after pruning. The final tree consists of three leaves which separate the $ x_1,x_2$-plane into three parts.
\includegraphics[scale=0.6]{tutosim3.ps}

Figure 3 shows the final tree for simulated data.


5.2 Boston Housing Data

The Boston housing data set bostonh.dat was collected by Harrison and Rubinfeld (1978). The following variables are in the data:
1.
crime rate
2.
percent of land zoned for large lots
3.
percent non retail business
4.
Charles river indicator, $ 1$ if on Charles river, 0 otherwise
5.
nitrogen oxide concentration
6.
average number of rooms
7.
percent built before $ 1980$
8.
weighted distance to employment centers
9.
accessibility to radial highways
10.
tax rate
11.
pupil-teacher ratio
12.
percent black
13.
percent lower status
14.
median value of owner-occupied homes in thousands of dollars.
The variable 14 is the response variable. The variables 1-13 are predictor variables. The 4-th and 9-th are categorical variables, the other are continuous. There are 506 observations. Let us generate such a tree that the number of observations in leaf nodes is less or equal to 8.

  library("xclust")
  randomize(10)
  boston=read("bostonh")
  boston=paf(boston,uniform(rows(boston))<0.20)
  yvar=boston[,14]
  xvar=boston[,1:13]
  type=matrix(13)
  type[4]=0
  type[9]=0
  opt=cartsplitopt("minsize",1,"mindev",0,"mincut",8)
  tr=cartsplit(xvar,yvar,type,opt)
  totleaves=leafnum(tr,1)
  totleaves
  plotcarttree(tr)
1299 cart04.xpl

We can observe that the tree tr with $ 29$ leaves is large.

Figure 4: Initial regression tree for Boston housing data. The total number of leaves (29) is shown at the right.
\includegraphics[scale=0.6]{tutobos1.ps}


It is not so easy to read Figure 4. We can look at the optimal subtree consisting of $ 10$ leaves by using these commands:

  prtr=prunetot(tr,10)
  plotcarttree(prtr)
1306 cart05.xpl

The Figure 5 shows pruning tree for Boston housing data.

Figure 5: Sub-Tree consisisting of 10 leaves for 20% sample of the Boston housing data
\includegraphics[scale=0.6]{tutobos2.ps}

Let us try to choose the optimal number of leaves with $ 10$ fold cross validation.

  cval=cartcv(xvar,yvar,type,opt,10)
  res=cval.lnumber~cval.alfa~cval.cv~cval.cvstd
  res=sort(res,1)
  res=res[1:12,]
  title=" no   alfa    cv   cvstd"
  restxt=title|string("%3.0f %6.2f %6.2f %6.2f",
res[,1], res[,2], res[,3], res[,4])

  dd=createdisplay(2,2)
  show(dd, 1, 1, cval.lnumber~cval.alfa)
  setgopt(dd, 1, 1, "title","number obs. vs alpha")
  show(dd, 1, 2, cval.lnumber~cval.cv)
  setgopt(dd, 1, 2, "title","number obs. vs cv")
  show(dd, 2, 1, cval.lnumber~cval.cvstd)
  setgopt(dd, 2, 1, "title","number obs. vs cvstd")
  show(dd, 2, 2, restxt)
1313 cart06.xpl

We get the result shown in Figure 6.

Figure 6: Cross-validation for 20% sample of Boston housing data.
\includegraphics[scale=0.6]{tutobos3.ps}

The first column gives the numbers of leaves in the sequence of pruned subtrees and the second column gives the sequence $ \alpha_i$, The estimates for the expectation of the mean of squared residuals, $ ER(\hat{f}_{\alpha_i})$, are in the third column of the above matrix. The fourth column gives the estimates of the standard error of the corresponding estimators. We can see that there is a clear minimum for the estimates for the expectation of the mean of squared residuals. Therefore, it seems reasonable to choose as final estimate the tree with $ 7$ leaves. Let us choose $ \alpha = 0.9$ and form the corresponding tree.

  fin=prune(tr,0.9)
  plotcarttree(fin)
1320 cart07.xpl

The final estimate is in the Figure 7.

Figure 7: Final tree for 20% sample of Boston housing data
\includegraphics[scale=0.6]{tutobos4.ps}

Let us look at the numbers of observations and the mean values in each node with command

  plotcarttree(fin,"nelem")
  plotcarttree(fin,"mean")
1327 cart08.xpl

The result is displayed in the Figure 8 and Figure 9 respectively.

Figure 8: Final tree for 20% sample of Boston housing data with numbers of observations
\includegraphics[scale=0.6]{tutobos5.ps}

Figure 9: Final tree for 20% sample of Boston housing data with mean values
\includegraphics[scale=0.6]{tutobos6.ps}


5.3 Density Estimation


regdat = 1340 dentoreg(dendat, binlkm)
transforms density data to regression data using variance stabilizing transform
Instead of writing separate procedures for the estimation of density functions, we will transform density data to the regression data and use regression tree to estimate density functions. The basic idea is to divide the sample space into bins, calculate the number of observations in every bin, and consider these frequencies as a dependent regression variable. The independent regression variables are the midpoints of the bins. To be more precise, after we have calculated the frequencies of the bins $ Z_i$, we will transform these to

$\displaystyle
Y_i = \sqrt{Z_i + 3/8}.
$

This was suggested by Anscombe (1948) and Donoho, Johnstone, Kerkyacharian, and Picard (1995, page 327). Use the procedure first to make a histogram estimator for the density. This estimator will have a large number of equal size bins and so it will not be a good density estimator, but we will then combine some of these bins together in an optimal way using CART. The new regression data will have dimension equal to the number of bins to the power of the number of variables. Given moment computing capability, probably 9 is the maximum number of variables for this method. As an example we will analyze data which consists of $ 200$ measurements on Swiss bank notes. These data are taken from Flury and Riedwyl (1988). One half of these bank notes are genuine, the other half are forged bank notes. The following variables are in the data.
1.
length of the note (width)
2.
height of the note (left)
3.
height of the note (right)
4.
distance of the inner frame to the lower border (bottom)
5.
distance of the inner frame to the upper border (top)
6.
length of the diagonal of the central picture (diagonal)
The macro 1343 dentoreg transforms density data to regression data. Let us choose $ 9$ bins for every coordinate axes because we for the last $ 3$ variables in the data.

  ; load library xclust and plot
  library("xclust")
  library("plot")

  ; set random seed
  randomize(1)
  ; read swiss banknote data
  dendat=read("bank2")
  ; select the last three variables
  dendat=dendat[,4:6]
  ; choose 9 bins in each dimension
  binlkm=9

  ; compute density estimate
  regdat=dentoreg(dendat,binlkm)

  ; compute CART and tree
  type=matrix(cols(dendat))
  opt=cartsplitopt("minsize",50,"mindev",0,"mincut",1)
  tr=cartsplit(regdat.ind,regdat.dep,type,opt)
  ; color datapoints after node the fall in
  g=cartregr(tr, dendat, "node")
  {gcat,gind}=groupcol(g, rows(g))
  ; compute cuts up level 2 for (X1,X2)
  xdat=regdat.ind
  gr12=grcart2(xdat, tr, 1, 2, 10, 0)
  xdat12=dendat[,1|2]
  setmaskp(xdat12, gind)
  ; compute cuts up level 2 for (X1,X3)
  gr13=grcart2(xdat, tr, 1, 3, 10, 0)
  xdat13=dendat[,1|3]
  setmaskp(xdat13, gind)
  ; compute cuts up level 2 for (X2,X3)
  gr23=grcart2(xdat, tr, 2, 3, 10, 0)
  xdat23=dendat[,2|3]
  setmaskp(xdat23, gind)

  ; compute tree and its labels
  {tree, treelabel}=grcarttree(tr)
  ; show all projections and the tree in a display
  setsize(640, 480)
  d=createdisplay(2,2)
  show(d, 1,1, xdat12, gr12)
  setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)")

  show(d,2,1, xdat13, gr13)
  setgopt(d,2,1, "xlabel", "top (X1)")
  setgopt(d,2,1, "ylabel", "diagonal (X3)")
  show(d, 2,2, xdat23, gr23)
  setgopt(d,2,2, "xlabel", "bottom (X2)")
  setgopt(d,2,2, "ylabel", "diagonal (X3)")
  axesoff()
  show(d, 1,2, tree, treelabel)
  axeson()
1347 cart09.xpl

Figure 10: The upper left plot gives the cuts in the bottom-top plane, the lower left plot the cuts in the bottom-diagonal plane and the lower right plot the cuts in the top-diagonal plane. The CART tree is shown in the upper right window.
\includegraphics[scale=0.6]{bankdens.ps}

The result is shown in Figure 10. The upper left plot gives the cuts in the bottom-top plane, the lower left plot the cuts in the bottom-diagonal plane and the lower right plot the cuts in the top-diagonal plane. The CART tree is shown in the upper right window. All splits are done in the bottom-diagonal plane. The lower right plot shows that CART algorithm just cuts from the main bulk of the data. Note the different colors in the left plots which shows that we have some cuts which are not visible in the top-bottom or top-diagonal projection. Since we have chosen to stop splitting if the number of the observations is less than $ 75$ (see the parameters 1353 cartsplitopt in 1356 cart09.xpl ) we may choose a smaller number. In 1359 cart10.xpl we have choosen a smaller number ($ 20$), do not color of the datapoints and omit the tree labels. The main result is here again that the CART algorithm cuts away the tails of the distribution and generates at least $ 4$ different group of nodes.

  ; load library xclust and plot
  library("xclust")
  library("plot")
  ; set random seed
  randomize(1)
  ; read swiss banknote data
  dendat=read("bank2")
  ; select the last three variables
  dendat=dendat[,4:6]
  ; choose 9 bins in each dimension
  binlkm=9
  ; compute density estimate
  regdat=dentoreg(dendat,binlkm)
  ; compute CART and tree
  type=matrix(cols(dendat))
  opt=cartsplitopt("minsize",20,"mindev",0,"mincut",1)
  tr=cartsplit(regdat.ind,regdat.dep,type,opt)
  ; compute cuts up level 2 for (X1,X2)
  xdat=regdat.ind
  gr12=grcart2(xdat, tr, 1, 2, 10, 0)
  xdat12=dendat[,1|2]
  ; compute cuts up level 2 for (X1,X3)
  gr13=grcart2(xdat, tr, 1, 3, 10, 0)
  xdat13=dendat[,1|3]
  ; compute cuts up level 2 for (X2,X3)
  gr23=grcart2(xdat, tr, 2, 3, 10, 0)
  xdat23=dendat[,2|3]
  ; compute tree and its labels
  {tree, treelabel}=grcarttree(tr)
  ; show all projections and the tree in a display
  setsize(640, 480)
  d=createdisplay(2,2)
  show(d, 1,1, xdat12, gr12)
  setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)")
  show(d, 2,1, xdat13, gr13)
  setgopt(d,2,1, "xlabel", "top (X1)", "ylabel", "diagonal (X3)")
  show(d, 2,2, xdat23, gr23)
  setgopt(d,2,2, "xlabel", "bottom (X2)", "ylabel", "diagonal (X3)")
  show(d, 1,2, tree)
  setgopt(d,1,2, "xlabel", " ", "ylabel", "log10(1+SSR)")
1363 cart10.xpl

Figure 11: The upper left plot gives the cuts in the bottom-top plane, the lower left plot the cuts in the bottom-diagonal plane and the lower right plot the cuts in the top-diagonal plane. The CART tree is shown in the upper right window.
\includegraphics[scale=0.6]{bankdens2.ps}



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de