Decision Trees

Introduction to Tree Methods
Tree methods are a supervised learning method. This means that there is a pre-determined output variable (classification or regression) that we will attempt to predict.

The tree attempts to recursively partition the predictor space to model the relationship between the predictors and the response variable y. There are several advantages to trees over some other methods.

  • Trees are non-parametric, requiring few statistical assumptions
  • Trees can be applied to various data types including ordered, categorical, and continuous variables (or any combination thereof)
  • Trees perform stepwise variable selection, complexity reduction, and interaction handling in an automatic manner
  • Trees are invariant under all monotone transformations of the individual ordered predictors
  • Trees are efficient in handling missing values, and there are methods we can use to determine variable importance for the output
  • THe output is easily understood and interpreted. An individual observation follows a path through the tree that we can interpret well. This is preferable compared to “black box” methods such as Neural Networks or Support Vector Machines
  • There is a tradeoff. Trees are robust, but they are also unstable and depend heavily on the data supplied

This tutorial will cover trees as covered by “CART”, by Breiman, Friedman, Olshen and Stone (1984). This methodology addressed some of the problems of prior tree methods, and will form the foundation of some of the ensemble methods we will explore later.


  • Trees are built from root nodes (top) to leaf or terminal nodes (bottom)
  • A record (observation) first enters the root node. A split (test) is applied to determine which child node it should go to next. This process is repeated until the record arrives at a terminal node
  • The path from the root to the leaf node provides an expression of a rule

CART Methodology
CART, standing for Classification And Regression Trees, prescribes a methdology to build and select a specific tree.

  1. Grow a large initial tree, T_0
  2. Iteratively truncat branches of T_0 to obtain a sequence of optimally pruned (nested) subtrees
  3. Select the best tree size based on validation provided by either a separate test sample (not used to build the tree), or by cross-validation (CV. Will go over later)

Grow a Large Initial Tree
To grow the initial tree, you need 4 elements.

  • A set of binary questions to induce a split
  • AA goodness of split criterion to evaluate a split
  • A stop-splitting rule
  • A rule for assigning every terminal node to a class (0 or 1)

Binary Questions
Binary questions are questions whose answers can be represented as a 0 or 1.

  • For Continuous Variables
    x>5? If yes, output 1, if no, output 0
  • For ordinal variables, we can generally regard them as continuous and treat them in the same way
  • For categorical variables, the binary question can be whether the observation belongs to a specific subset of possible values. If the possible values for x are \{a,b,c,d,e\},
    x \in a,b,d? If yes, output 1, of no, output 0

Goodness of Split Criterion
We will use a node impurity based splitting criterion to evaluate the splits. In general, the impurity i(t) can be defined as a nonnegative function of p(0|t) and p(1|t). These denote the proportions of node t beloing to classes 0,1 respectively.

    \begin{equation*} i(t) = \phi(p(y=1|t)) \end{equation*}

The impurity function is maximized when both classes are equally mixed, and is the smallest when the node contains only one class. Thus it has the following properties.

  1. \phi(p)\geq 0
  2. \phi(p) Attains minimum 0 when p=0 or p=1
  3. \phi(p) Attains its maximum when p=1-p=\frac{1}{2}
  4. \phi(p) = \phi(1-p). IE, \phi(p) is symmetric about p=\frac{1}{2}.
  5. Common choices for \phi include the Minimum Error, the Entropy function, and the Gini Index.
    Minimum or Baye’s Error

        \begin{equation*} \phi(p) = \text{min}(p,1-p) \end{equation*}

    This measure corresponds to misclassification rate when majority vote is used. This error is rarely used in practice due to its inability to sufficiently reward purer nodes.

    Entropy Function

        \begin{equation*} \phi(p) = -p\log(p) - (1-p)\log(1-p) \end{equation*}

    The entropy function is equivalent to using the likelihood ratio \chi^2 statistic for association between the branches and the target categories.

    Gini Index

        \begin{equation*} \phi(p) = p(1-p) \end{equation*}

    This is the method proposed in the original CART Method. It has been observed since that the rule has an undesireable endcut preference problem. It gives preference to the splits thatresult in two child nodes of extremely unbalanced size. To resolve this problem, a method callled the \Delta Splitting Method has been adopted in which the method will not make a cut on the last \Delta percent of the data.

    Goodness of Split Measure
    Generally speaking, the goodness of split measure is the measure of the change of node impurity. Let s e a candidate spkit and suppose s divides node t into t_L and t_R such that the proportions of cases that go into t_L and t_R are p_L and p_R repsecitvley. We can define the reduction in node impurity as

        \begin{equation*} \Delta i(s,t) = i(t)-[p_Li(t_L) + p_Ri(t_R)] \end{equation*}

    Which provides the Goodness of Split measure for s. The best split s* for node t is that which maximizes the impurity reduction.

        \begin{equation*} \Delta i(s*,t) = \max_{s\in S}i(s,t) \end{equation*}

    Then t will be split into t_L and t_R according to split s* and the search procedure will be repeated on t_L and t_R separately. A node becomes a terminal node when pre-specified terminal node conditions are satisfied.

    Tree Pruning
    Going back to the drawbacks of tree methods, trees are robust yet unstable. They heavily depend on the data that was given to them to build the tree, so they adapt to both the pattern of the data (what we want to see) and the noise of the data (what we’re trying to avoid). To obtain a tree model that can generalize well to new data, we trim back the tree iteratively to obtain a nested sequence of optimally pruned subtrees.

    Note that there exists a balance between bias and variance within tree moodels. A well fitted tree has a low bias (adapts to signal) and low variance (does not adapt to noise). For a given dataset, the more complex a tree is, the more it adapts to noise and therefore increases variance. An under-fitted tree has high bias and low variance. An overfitted tree has low bias and high variance.

    Cost Complexity Measure
    In CART, the complexity of a tree is determined by the total number of terminal nodes it has. We will again need to define some terminology:

    • Descendent and Ancestor nodes refer to the relationships between nodes. A node t is called a descendant of a higher node (ancestor node) h if there is a path leading down the tree from h to t
    • Let \tt denote the set of all terminal nodes of T, and T-\tt the set of all internal nodes of T. Furthermore, |.| denotes cardinality, IE the number of elements in the set. |\tt| represents the number of terminal nodes in tree T
    • A tree T_1 is a Subtree of T if T_1 has the same root node and T and for every node in T1, there exists the same node in T. We denote a subtree using the \succeq notation. T_1 \succeq T
    • A tree T_h is called a branch of T if T_h with root node h\in T and all descendants of h\in T are descendants of h\in T_h
    • We Prune a branch T_h off T by cutting off all descendants of the root node h in T. The pruned tree is denoted as T-T_h.
    • The Misclassification Cost of a terminal node t is given as R(t)


    Tree Size Selection
    As the third step in the CART algorithm, we are tasked with choosing the most appropriate tree size to give the finalmodel. A natural choice is to choose the subtree that optimizes an estimate of the performance measure. However, because of the adaptive nature of trees, naively using the same data used to train the tree for testing the tree will result in an overly optimistic estimate for the performance of the model. We have a couple ways to go around this.

    Test Sample Method

    1. Split data randomly into 2 sets: L_1, the training set, and L_2, the test set
      • The ratio at which we split the data is subjective. If we have a huge amount of data, then we can afford to give a large amount (50%) to the sample set. If we have a small amount of data, we may give as little as 25% to the sample set
    2. Using the training sample L_1 only, grow a large initial tree and prune it back to obtain the nested sequence of subtrees, T_0 \succ \ldots \succ T_M
    3. Send the test sample L_2 down each subtree and compute the misclassification cost R^{ts}(T_m) based on the test sample for each subtree T_m, m=0,1,\ldots,M
      • The subtree having the smallest misclassification cost is selected the best subtree T*

            \begin{equation*} 	R^{ts}(T*) = \min_m R^{ts}(T_m) 	\end{equation*}

    4. Once the best subtree is determined, R^{ts}(T*) is used as an estimate of the misclassification cost

    There are advantages and disadvantages to the test sample method. It is very straightforward and simple. However, it does not use all the data. The sequence of subtrees, from which the best tree model is selected, is based on only part of the data.

    Cross-validation arises to correct some of the shortcomings of the test sample method. It allows us to generate reliable out of sample error estimates without wasting data. The method explained here covers V-fold cross-validation, where we make V subsets of the data, train and calculate the out of sample error for each subset, and pick the complexity of the tree that minimizes it.

    V-fold Cross-Validation
    Whole sample L is randomly divided into V subsets. L_v, v=1,\ldots,V. The sample sizes should be approximately equal. The vth learning sample is denoted as

        \begin{equation*} L^{(v)} = L - L_v\hspace{1cm}v = 1,\ldots,V \end{equation*}

    The vth subset L_v is used as the test sample corresponding to L^{(v)}. CART suggests that we set V=10, training on 90% of the data and performing the out of sample test on 10% of the data at a time.

    1. For a fixed v, grow a large initial tree using the training set L^{(v)}. Use the pruning procedure to obtain a nested sequence of pruned subtrees, T_0^{(v)} \succ \ldots \succ T_M^{(v)}.
    2. Grow and prune a tree based on all the data to obtain the nested sequence of best-pruned subtress T_0 \succ T_1 \succ \ldots \succ T_M and the corresponding sequence of complexity parameters 0 = \alpha_0 < \ldots < \alpha_M < \alpha_M+1 = \infty
    3. Find the Complexity Parameter (CP value) associated with the minimum cross-validated misclassification cost
    4. Choose the tree associated with that CP value as the optimal tree

    In the interest of making tree models useful to new data, we may adopt a rule with regard to choosing the optimal complexity parameter. the 1-SE Rule advises us to find the complexity parameter with an associated cross-validated error less than the minimum cross-validated standard error + 1 standard deviation of that error. use of this rule is left up to the researcher. You are best off looking at a plot of tree complexity versus cross-validated error, and making your own choice.

    Additional Information
    The material contained in this section was derived from Dr. Fan’s lecture notes: ctree.pdf. These notes develop the solid mechanical underpinnings of the CART algorithm.
    See also this Youtube Presentation by mathematicalmonk.