5. Examples
5.1 Simulated Example
Let us generate the observations
where
are independently uniformly distributed on
, and
are independently standardly normally distributed.
Figure 1 shows the data simulated from the function
.
The quantlet for generating the observations is
proc(y)=tuto(seed,n)
randomize(seed)
xdat=uniform(n,2)
index=(xdat[,2]<=0.5)+(xdat[,2]>0.5).*(xdat[,1]<=0.5)*2
layout=3*(index==1)+4.*(index==0)+5.*(index==2)
ydat=100.*(index==2)+120.*(index==0)
y=list(xdat,ydat,layout)
endp
library("xclust")
d=createdisplay(1,1)
data=tuto(1,100)
x=data.xdat
setmaskp(x, data.layout, data.layout, 8)
show(d,1,1,x)
Let us grow such a tree that the number of observations in a leaf
nodes is less or equal to
(mincut), the deviation in a
leaf node is larger or equal 0 (mindev) and cut will be
only done if the number of the resulting nodes is larger as
(minsize). The type of variable is continuous.
library("xclust")
data=tuto(1,100)
type=#(1,1)
opt=cartsplitopt("minsize",1,"mindev",0,"mincut",5)
tr=cartsplit(data.xdat,data.ydat,type,opt)
totleaves=leafnum(tr,1)
totleaves
plotcarttree(tr)
Figure 2 shows the regression tree tr with 41 leaves.
From this figure, we prefer to choose the tree with 3 leaves
because it is easier to see that in general it has three groups.
Figure 2:
Initial regression tree for 100 simulated data from function
(left).
The total number of leaves (
) is shown at the right.
|
Let us choose the tree with
leaves with the following command.
trfin=prunetot(tr,3)
plotcarttree(trfin)
Figure 3:
Final regression tree for 100 simulated data from function
after pruning.
The final tree consists of three leaves which separate the
-plane into three parts.
|
Figure 3 shows the final tree for simulated data.
5.2 Boston Housing Data
The Boston housing data set bostonh.dat
was collected by
Harrison and Rubinfeld (1978).
The following variables are in the data:
- 1.
- crime rate
- 2.
- percent of land zoned for large lots
- 3.
- percent non retail business
- 4.
- Charles river indicator,
if on Charles river, 0 otherwise
- 5.
- nitrogen oxide concentration
- 6.
- average number of rooms
- 7.
- percent built before
- 8.
- weighted distance to employment centers
- 9.
- accessibility to radial highways
- 10.
- tax rate
- 11.
- pupil-teacher ratio
- 12.
- percent black
- 13.
- percent lower status
- 14.
- median value of owner-occupied homes in thousands of dollars.
The variable 14 is the response variable.
The variables 1-13 are predictor variables.
The 4-th and 9-th are categorical variables, the other are
continuous. There are 506 observations.
Let us generate such a tree that the number of observations in leaf
nodes is less or equal to 8.
library("xclust")
randomize(10)
boston=read("bostonh")
boston=paf(boston,uniform(rows(boston))<0.20)
yvar=boston[,14]
xvar=boston[,1:13]
type=matrix(13)
type[4]=0
type[9]=0
opt=cartsplitopt("minsize",1,"mindev",0,"mincut",8)
tr=cartsplit(xvar,yvar,type,opt)
totleaves=leafnum(tr,1)
totleaves
plotcarttree(tr)
We can observe that the tree tr with
leaves is large.
Figure 4:
Initial regression tree for Boston housing data.
The total number of leaves (29) is shown at the right.
|
It is not so easy to read Figure 4. We can
look at the optimal subtree consisting of
leaves by using these commands:
prtr=prunetot(tr,10)
plotcarttree(prtr)
The Figure 5 shows pruning tree for Boston housing data.
Figure 5:
Sub-Tree consisisting of 10 leaves for 20% sample of the Boston housing data
|
Let us try to choose the optimal number of leaves with
fold cross
validation.
cval=cartcv(xvar,yvar,type,opt,10)
res=cval.lnumber~cval.alfa~cval.cv~cval.cvstd
res=sort(res,1)
res=res[1:12,]
title=" no alfa cv cvstd"
restxt=title|string("%3.0f %6.2f %6.2f %6.2f",
res[,1], res[,2], res[,3], res[,4])
dd=createdisplay(2,2)
show(dd, 1, 1, cval.lnumber~cval.alfa)
setgopt(dd, 1, 1, "title","number obs. vs alpha")
show(dd, 1, 2, cval.lnumber~cval.cv)
setgopt(dd, 1, 2, "title","number obs. vs cv")
show(dd, 2, 1, cval.lnumber~cval.cvstd)
setgopt(dd, 2, 1, "title","number obs. vs cvstd")
show(dd, 2, 2, restxt)
We get the result shown in Figure 6.
Figure 6:
Cross-validation for 20% sample of Boston housing data.
|
The first column gives the numbers of leaves in the sequence of pruned
subtrees and the second column gives the sequence
,
The estimates for the expectation of the mean of squared residuals,
, are in the third column of the above matrix.
The fourth column gives the estimates of the standard error of the
corresponding estimators.
We can see that there is a clear minimum for the
estimates for the expectation of the mean of squared residuals.
Therefore, it seems reasonable to choose as final estimate the tree with
leaves.
Let us choose
and form the corresponding tree.
fin=prune(tr,0.9)
plotcarttree(fin)
The final estimate is in the Figure 7.
Figure 7:
Final tree for 20% sample of Boston housing data
|
Let us look at the numbers of observations and the mean values in each node with command
plotcarttree(fin,"nelem")
plotcarttree(fin,"mean")
The result is displayed in the Figure 8 and Figure 9 respectively.
Figure 8:
Final tree for 20% sample of Boston housing data with numbers of observations
|
Figure 9:
Final tree for 20% sample of Boston housing data with mean values
|
5.3 Density Estimation
- regdat =
dentoreg(dendat, binlkm)
- transforms density data to regression data using variance stabilizing transform
|
Instead of writing separate procedures for the estimation
of density functions, we will transform density data to the
regression data and use regression tree to estimate density
functions.
The basic idea is to divide the sample space into bins,
calculate the number of observations in every bin, and
consider these frequencies as a dependent regression variable.
The independent regression variables are the midpoints of the bins.
To be more precise, after we have calculated the frequencies of
the bins
, we will transform these to
This was suggested by
Anscombe (1948) and
Donoho, Johnstone, Kerkyacharian, and Picard (1995, page 327).
Use the procedure first to make a histogram estimator
for the density. This estimator will have a large number of
equal size bins and so it will not be a good density estimator,
but we will then combine some of these bins together in an optimal
way using CART.
The new regression data will have dimension equal to the
number of bins to the power of the number of variables.
Given moment computing capability, probably 9 is the maximum number of
variables for this method.
As an example we will analyze data which consists of
measurements
on Swiss bank notes. These data are taken from Flury and Riedwyl (1988).
One half of these bank notes are genuine, the other half are forged
bank notes. The following variables are in the data.
- 1.
- length of the note (width)
- 2.
- height of the note (left)
- 3.
- height of the note (right)
- 4.
- distance of the inner frame to the lower border
(bottom)
- 5.
- distance of the inner frame to the upper border (top)
- 6.
- length of the diagonal of the central picture
(diagonal)
The macro
dentoreg
transforms density data to regression data.
Let us choose
bins for every coordinate axes because we
for the last
variables in the data.
; load library xclust and plot
library("xclust")
library("plot")
; set random seed
randomize(1)
; read swiss banknote data
dendat=read("bank2")
; select the last three variables
dendat=dendat[,4:6]
; choose 9 bins in each dimension
binlkm=9
; compute density estimate
regdat=dentoreg(dendat,binlkm)
; compute CART and tree
type=matrix(cols(dendat))
opt=cartsplitopt("minsize",50,"mindev",0,"mincut",1)
tr=cartsplit(regdat.ind,regdat.dep,type,opt)
; color datapoints after node the fall in
g=cartregr(tr, dendat, "node")
{gcat,gind}=groupcol(g, rows(g))
; compute cuts up level 2 for (X1,X2)
xdat=regdat.ind
gr12=grcart2(xdat, tr, 1, 2, 10, 0)
xdat12=dendat[,1|2]
setmaskp(xdat12, gind)
; compute cuts up level 2 for (X1,X3)
gr13=grcart2(xdat, tr, 1, 3, 10, 0)
xdat13=dendat[,1|3]
setmaskp(xdat13, gind)
; compute cuts up level 2 for (X2,X3)
gr23=grcart2(xdat, tr, 2, 3, 10, 0)
xdat23=dendat[,2|3]
setmaskp(xdat23, gind)
; compute tree and its labels
{tree, treelabel}=grcarttree(tr)
; show all projections and the tree in a display
setsize(640, 480)
d=createdisplay(2,2)
show(d, 1,1, xdat12, gr12)
setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)")
show(d,2,1, xdat13, gr13)
setgopt(d,2,1, "xlabel", "top (X1)")
setgopt(d,2,1, "ylabel", "diagonal (X3)")
show(d, 2,2, xdat23, gr23)
setgopt(d,2,2, "xlabel", "bottom (X2)")
setgopt(d,2,2, "ylabel", "diagonal (X3)")
axesoff()
show(d, 1,2, tree, treelabel)
axeson()
Figure 10:
The upper left plot gives
the cuts in the bottom-top plane, the lower left plot the cuts
in the bottom-diagonal plane and the lower right plot the cuts in
the top-diagonal plane. The CART tree is shown in the upper right
window.
|
The result is shown in Figure 10. The upper left plot gives
the cuts in the bottom-top plane, the lower left plot the cuts
in the bottom-diagonal plane and the lower right plot the cuts in
the top-diagonal plane. The CART tree is shown in the upper right
window.
All splits are done in the bottom-diagonal plane. The lower right
plot shows that CART algorithm just cuts from the main bulk of the
data. Note the different colors in the left plots which shows that
we have some cuts which are not visible in the top-bottom or
top-diagonal projection.
Since we have chosen to stop splitting if the number of the
observations is less than
(see the parameters
cartsplitopt
in
cart09.xpl
) we may choose a smaller number.
In
cart10.xpl
we have choosen a smaller number (
), do not
color of the datapoints and omit the tree labels. The main result
is here again that the CART algorithm cuts away the tails of the
distribution and generates at least
different group of nodes.
; load library xclust and plot
library("xclust")
library("plot")
; set random seed
randomize(1)
; read swiss banknote data
dendat=read("bank2")
; select the last three variables
dendat=dendat[,4:6]
; choose 9 bins in each dimension
binlkm=9
; compute density estimate
regdat=dentoreg(dendat,binlkm)
; compute CART and tree
type=matrix(cols(dendat))
opt=cartsplitopt("minsize",20,"mindev",0,"mincut",1)
tr=cartsplit(regdat.ind,regdat.dep,type,opt)
; compute cuts up level 2 for (X1,X2)
xdat=regdat.ind
gr12=grcart2(xdat, tr, 1, 2, 10, 0)
xdat12=dendat[,1|2]
; compute cuts up level 2 for (X1,X3)
gr13=grcart2(xdat, tr, 1, 3, 10, 0)
xdat13=dendat[,1|3]
; compute cuts up level 2 for (X2,X3)
gr23=grcart2(xdat, tr, 2, 3, 10, 0)
xdat23=dendat[,2|3]
; compute tree and its labels
{tree, treelabel}=grcarttree(tr)
; show all projections and the tree in a display
setsize(640, 480)
d=createdisplay(2,2)
show(d, 1,1, xdat12, gr12)
setgopt(d,1,1, "xlabel", "top (X1)", "ylabel", "bottom (X2)")
show(d, 2,1, xdat13, gr13)
setgopt(d,2,1, "xlabel", "top (X1)", "ylabel", "diagonal (X3)")
show(d, 2,2, xdat23, gr23)
setgopt(d,2,2, "xlabel", "bottom (X2)", "ylabel", "diagonal (X3)")
show(d, 1,2, tree)
setgopt(d,1,2, "xlabel", " ", "ylabel", "log10(1+SSR)")
Figure 11:
The upper left plot gives
the cuts in the bottom-top plane, the lower left plot the cuts
in the bottom-diagonal plane and the lower right plot the cuts in
the top-diagonal plane. The CART tree is shown in the upper right
window.
|