1. Growing the Tree


cs = 1131 cartsplit(x, y, type{, opt})
grows the tree
opt = 1134 cartsplitopt(s1{, s2, s3,$ \dots$})
sets the parameters for growing the tree
Growing the tree proceeds sequentially. As a first step we take the regression estimator to be just a constant over the sample space. The constant in question is the mean value of the response variable. Thus, when the observed values of the response variable are $ Y_1, \ldots Y_n$, the regression estimator is given by

$\displaystyle

\hat{f}(x) = \left( \frac{1}{n} \sum_{i=1}^n Y_i \right) I_R(x)

$

where $ R$ is the sample space and $ I_R$ is the indicator function of $ R$. We assume that the sample space $ R$, that is, the space of the values of the regression variables, is a rectangle. Secondly the sample space is divided into two parts. Some regression variable $ X_j$ is chosen, and if $ X_j$ is a continuous random variable, then some real number $ a$ is chosen, and we define

$\displaystyle

R_1 = \{ x \in R: x_j \leq a \}, \, \, \,

R_2 = \{ x \in R: x_j > a \} .

$

If $ X_j$ is categorical random variable with values $ A_1,\ldots ,A_q$, then some subset $ I \subset \{ A_1,\ldots ,A_q \}$ is chosen, and we define

$\displaystyle

R_1 = \{ x \in R: x_j \in I \}, \, \, \,

R_2 = \{ x \in R: x_j \in \{ A_1,\ldots ,A_q \} \backslash I \} .

$

The regression estimator in the second step is

$\displaystyle

\hat{f}(x) = \left( \frac{1}{\vert I_1\vert} \sum_{I_1} Y_i \ri...

...{R_1}(x)

+ \left( \frac{1}{\vert I_2\vert} \sum_{I_2} Y_i \right) I_{R_2}(x)

$

where $ I_1 = \{ i: X_i \in R_1 \}$ and $ \vert I_1\vert$ is the number of elements in $ I_1$. The splitting of $ R$ to $ R_1$ and $ R_2$ is chosen in such a way that the sum of squared residuals of the estimator $ \hat{f}$ is minimized. The sum of squared residuals is defined as

$\displaystyle

\sum_{i=1}^n \left( Y_i - \hat{f}(X_i) \right)^2 .

$

Now we proceed to split $ R_1$ and $ R_2$ separately. Splitting is continued in this way until the number of observations in every rectangle is small or the sum of squared residuals is small. The rectangle $ R$ corresponds to the root node of the binary tree. The rectangle $ R_1$ is the left child node and the rectangle $ R_2$ is the right child node. The end result is a binary tree.

Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de