\documentclass[nojss]{jss} %\VignetteIndexEntry{partykit: A Toolkit for Recursive Partytioning} %\VignetteDepends{partykit} %\VignetteKeywords{recursive partitioning, regression trees, classification trees, decision trees} %\VignettePackage{partykit} %% packages \usepackage{amstext} \usepackage{amsfonts} \usepackage{amsmath} \usepackage{thumbpdf} \usepackage{rotating} %% need no \usepackage{Sweave} %% additional commands \newcommand{\squote}[1]{`{#1}'} \newcommand{\dquote}[1]{``{#1}''} \newcommand{\fct}[1]{\texttt{#1()}} \newcommand{\class}[1]{\squote{\texttt{#1}}} %% for internal use \newcommand{\fixme}[1]{\emph{\marginpar{FIXME} (#1)}} \newcommand{\readme}[1]{\emph{\marginpar{README} (#1)}} \hyphenation{Qua-dra-tic} \title{\pkg{partykit}: A Toolkit for Recursive Partytioning} \Plaintitle{partykit: A Toolkit for Recursive Partytioning} \author{Achim Zeileis\\Universit\"at Innsbruck \And Torsten Hothorn\\Universit\"at Z\"urich} \Plainauthor{Achim Zeileis, Torsten Hothorn} \Abstract{ The \pkg{partykit} package provides a flexible toolkit with infrastructure for learning, representing, summarizing, and visualizing a wide range of tree-structured regression and classification models. The functionality encompasses: (a)~Basic infrastructure for \emph{representing} trees (inferred by any algorithm) so that unified \code{print}/\code{plot}/\code{predict} methods are available. (b)~Dedicated methods for trees with \emph{constant fits} in the leaves (or terminal nodes) along with suitable coercion functions to create such tree models (e.g., by \pkg{rpart}, \pkg{RWeka}, PMML). (c)~A reimplementation of \emph{conditional inference trees} (\code{ctree}, originally provided in the \pkg{party} package). (d)~An extended reimplementation of \emph{model-based recursive partitioning} (\code{mob}, also originally in \pkg{party}) along with dedicated methods for trees with parametric models in the leaves. This vignette gives a brief overview of the package and discusses in detail the generic infrastructure for representing trees (a). Items~(b)--(d) are discussed in the remaining vignettes in the package. } \Keywords{recursive partitioning, regression trees, classification trees, decision trees} \Address{ Achim Zeileis \\ Department of Statistics \\ Faculty of Economics and Statistics \\ Universit\"at Innsbruck \\ Universit\"atsstr.~15 \\ 6020 Innsbruck, Austria \\ E-mail: \email{Achim.Zeileis@R-project.org} \\ URL: \url{http://eeecon.uibk.ac.at/~zeileis/} \\ Torsten Hothorn\\ Institut f\"ur Epidemiologie, Biostatistik und Pr\"avention \\ Universit\"at Z\"urich \\ Hirschengraben 84\\ CH-8001 Z\"urich, Switzerland \\ E-mail: \email{Torsten.Hothorn@R-project.org}\\ URL: \url{http://user.math.uzh.ch/hothorn/} } \begin{document} \SweaveOpts{eps=FALSE, keep.source=TRUE, eval = TRUE} <>= suppressWarnings(RNGversion("3.5.2")) options(width = 70) library("partykit") set.seed(290875) @ \section{Overview} \label{sec:overview} In the more than fifty years since \cite{Morgan+Sonquist:1963} published their seminal paper on ``automatic interaction detection'', a wide range of methods has been suggested that is usually termed ``recursive partitioning'' or ``decision trees'' or ``tree(-structured) models'' etc. Particularly influential were the algorithms CART \citep[classification and regression trees,][]{Breiman+Friedman+Olshen:1984}, C4.5 \citep{Quinlan:1993}, QUEST/GUIDE \citep{Loh+Shih:1997,Loh:2002}, and CTree \citep{Hothorn+Hornik+Zeileis:2006} among many others \citep[see][for a recent overview]{Loh:2014}. Reflecting the heterogeneity of conceptual algorithms, a wide range of computational implementations in various software systems emerged: Typically the original authors of an algorithm also provide accompanying software but many software systems, e.g., including \pkg{Weka} \citep{Witten+Frank:2005} or \proglang{R} \citep{R}, also provide collections of various types of trees. Within \proglang{R} the list of prominent packages includes \pkg{rpart} \citep[implementing the CART algorithm]{rpart}, \pkg{mvpart} \citep[for multivariate CART]{mvpart}, \pkg{RWeka} \citep[containing interfaces to J4.8, M5', LMT from \pkg{Weka}]{RWeka}, and \pkg{party} \citep[implementing CTree and MOB]{party} among many others. See the CRAN task view ``Machine Learning'' \citep{ctv} for an overview. All of these algorithms and software implementations have to deal with very similar challenges. However, due to the fragmentation of the communities in which the corresponding research is published -- ranging from statistics over machine learning to various applied fields -- many discussions of the algorithms do not reuse established theoretical results and terminology. Similarly, there is no common ``language'' for the software implementations and different solutions are provided by different packages (even within \proglang{R}) with relatively little reuse of code. The \pkg{partykit} tries to address the latter point and improve the computational situation by providing a common unified infrastructure for recursive partytioning in the \proglang{R} system for statistical computing. In particular, \pkg{partykit} provides tools for representing fitted trees along with printing, plotting, and computing predictions. The design principles are: \begin{itemize} \item One `agnostic' base class (\class{party}) which can encompass an extremely wide range of different types of trees. \item Subclasses for important types of trees, e.g., trees with constant fits (\class{constparty}) or with parametric models (\class{modelparty}) in each terminal node (or leaf). \item Nodes are recursive objects, i.e., a node can contain child nodes. \item Keep the (learning) data out of the recursive node and split structure. \item Basic printing, plotting, and predicting for raw node structure. \item Customization via suitable panel or panel-generating functions. \item Coercion from existing object classes in \proglang{R} (\code{rpart}, \code{J48}, etc.) to the new class. \item Usage of simple/fast \proglang{S}3 classes and methods. \end{itemize} In addition to all of this generic infrastructure, two specific tree algorithms are implemented in \pkg{partykit} as well: \fct{ctree} for conditional inference trees \citep{Hothorn+Hornik+Zeileis:2006} and \fct{mob} for model-based recursive partitioning \citep{Zeileis+Hothorn+Hornik:2008}. This vignette (\code{"partykit"}) introduces the basic \class{party} class and associated infrastructure while three further vignettes discuss the tools built on top of it: \code{"constparty"} covers the eponymous class for constant-fit trees along with suitable coercion functions, and \code{"ctree"} and \code{"mob"} discuss the new \fct{ctree} and \fct{mob} implementations, respectively. Each of the vignettes can be viewed within \proglang{R} via \code{vignette(}\emph{``name''}\code{, package = "partykit")}. Normal users reading this vignette will typically be interested only in the motivating example in Section~\ref{sec:intro} while the remaining sections are intended for programmers who want to build infrastructure on top of \pkg{partykit}. \section{Motivating example} \label{sec:intro} \subsection{Data} To illustrate how \pkg{partykit} can be used to represent trees, we employ a simple artificial data set taken from \cite{Witten+Frank:2005}. It concerns the conditions suitable for playing some unspecified game: % <>= data("WeatherPlay", package = "partykit") WeatherPlay @ % To represent the \code{play} decision based on the corresponding weather condition variables one could use the tree displayed in Figure~\ref{fig:weather-plot}. For now, it is ignored how this tree was inferred and it is simply assumed to be given. \setkeys{Gin}{width=0.8\textwidth} \begin{figure}[t!] \centering <>= py <- party( partynode(1L, split = partysplit(1L, index = 1:3), kids = list( partynode(2L, split = partysplit(3L, breaks = 75), kids = list( partynode(3L, info = "yes"), partynode(4L, info = "no"))), partynode(5L, info = "yes"), partynode(6L, split = partysplit(4L, index = 1:2), kids = list( partynode(7L, info = "yes"), partynode(8L, info = "no"))))), WeatherPlay) plot(py) @ \caption{\label{fig:weather-plot} Decision tree for \code{play} decision based on weather conditions in \code{WeatherPlay} data.} \end{figure} \setkeys{Gin}{width=\textwidth} To represent this tree (or recursive partition) in \pkg{partykit}, two basic building blocks are used: splits of class \class{partysplit} and nodes of class \class{partynode}. The resulting recursive partition can then be associated with a data set in an object of class \class{party}. \subsection{Splits} First, we employ the \fct{partysplit} function to create the three splits in the ``play tree'' from Figure~\ref{fig:weather-plot}. The function takes the following arguments \begin{Code} partysplit(varid, breaks = NULL, index = NULL, ..., info = NULL) \end{Code} where \code{varid} is an integer id (column number) of the variable used for splitting, e.g., \code{1L} for \code{outlook}, \code{3L} for \code{humidity}, \code{4L} for \code{windy} etc. Then, \code{breaks} and \code{index} determine which observations are sent to which of the branches, e.g., \code{breaks = 75} for the humidity split. Apart from further arguments not shown above (and just comprised under `\code{...}'), some arbitrary information can be associated with a \class{partysplit} object by passing it to the \code{info} argument. The three splits from Figure~\ref{fig:weather-plot} can then be created via % <>= sp_o <- partysplit(1L, index = 1:3) sp_h <- partysplit(3L, breaks = 75) sp_w <- partysplit(4L, index = 1:2) @ % For the numeric \code{humidity} variable the \code{breaks} are set while for the factor variables \code{outlook} and \code{windy} the information is supplied which of the levels should be associated with which of the branches of the tree. \subsection{Nodes} Second, we use these splits in the creation of the whole decision tree. In \pkg{partykit} a tree is represented by a \class{partynode} object which is recursive in that it may have ``kids'' that are again \class{partynode} objects. These can be created with the function \begin{Code} partynode(id, split = NULL, kids = NULL, ..., info = NULL) \end{Code} where \code{id} is an integer identifier of the node number, \code{split} is a \class{partysplit} object, and \code{kids} is a list of \class{partynode} objects. Again, there are further arguments not shown (\code{...}) and arbitrary information can be supplied in \code{info}. The whole tree from Figure~\ref{fig:weather-plot} can then be created via % <>= pn <- partynode(1L, split = sp_o, kids = list( partynode(2L, split = sp_h, kids = list( partynode(3L, info = "yes"), partynode(4L, info = "no"))), partynode(5L, info = "yes"), partynode(6L, split = sp_w, kids = list( partynode(7L, info = "yes"), partynode(8L, info = "no"))))) @ % where the previously created \class{partysplit} objects are used as splits and the nodes are simply numbered (depth first) from~1 to~8. For the terminal nodes of the tree there are no \code{kids} and the corresponding \code{play} decision is stored in the \code{info} argument. Printing the \class{partynode} object reflects the recursive structure stored. % <>= pn @ % However, the displayed information is still rather raw as it has not yet been associated with the \code{WeatherPlay} data set. \subsection{Trees (or recursive partitions)} Therefore, in a third step the recursive tree structure stored in \code{pn} is coupled with the corresponding data in a \class{party} object. % <>= py <- party(pn, WeatherPlay) print(py) @ % Now, Figure~\ref{fig:weather-plot} can easily be created by % <>= plot(py) @ % In addition to \fct{print} and \fct{plot}, the \fct{predict} method can now be applied, yielding the predicted terminal node IDs. % <>= predict(py, head(WeatherPlay)) @ % In addition to the \class{partynode} and the \class{data.frame}, the function \fct{party} takes several further arguments \begin{Code} party(node, data, fitted = NULL, terms = NULL, ..., info = NULL) \end{Code} i.e., \code{fitted} values, a \code{terms} object, arbitrary additional \code{info}, and again some further arguments comprised in \code{...}. \subsection{Methods and other utilities} The main idea about the \class{party} class is that tedious tasks such as \fct{print}, \fct{plot}, \fct{predict} do not have to be reimplemented for every new kind of decision tree but can simply be reused. However, in addition to these three basic tasks (as already illustrated above) developers of tree model software also need further basic utiltities for working with trees: e.g., functions for querying or subsetting the tree and for customizing printed/plotted output. Below, various utilities provided by the \pkg{partykit} package are introduced. For querying the dimensions of the tree, three basic functions are available: \fct{length} gives the number of kid nodes of the root node, \fct{depth} the depth of the tree and \fct{width} the number of terminal nodes. % <>= length(py) width(py) depth(py) @ % As decision trees can grow to be rather large, it is often useful to inspect only subtrees. These can be easily extracted using the standard \code{[} or \code{[[} operators: % <>= py[6] @ % The resulting object is again a full valid \class{party} tree and can hence be printed (as above) or plotted (via \code{plot(py[6])}, see the left panel of Figure~\ref{fig:plot-customization}). Instead of using the integer node IDs for subsetting, node labels can also be used. By default thise are just (character versions of) the node IDs but other names can be easily assigned: % <>= py2 <- py names(py2) names(py2) <- LETTERS[1:8] py2 @ % The function \fct{nodeids} queries the integer node IDs belonging to a \class{party} tree. By default all IDs are returned but optionally only the terminal IDs (of the leaves) can be extracted. % <>= nodeids(py) nodeids(py, terminal = TRUE) @ % Often functions need to be applied to certain nodes of a tree, e.g., for extracting information. This is accomodated by a new generic function \fct{nodeapply} that follows the style of other \proglang{R} functions from the \code{apply} family and has methods for \class{party} and \class{partynode} objects. Furthermore, it needs a set of node IDs (often computed via \fct{nodeids}) and a function \code{FUN} that is applied to each of the requested \class{partynode} objects, typically for extracting/formatting the \code{info} of the node. % <>= nodeapply(py, ids = c(1, 7), FUN = function(n) n$info) nodeapply(py, ids = nodeids(py, terminal = TRUE), FUN = function(n) paste("Play decision:", n$info)) @ % Similar to the functions applied in a \fct{nodeapply}, the \fct{print}, \fct{predict}, and \fct{plot} methods can be customized through panel function that format certain parts of the tree (such as header, footer, node, etc.). Hence, the same kind of panel function employed above can also be used for predictions: <>= predict(py, FUN = function(n) paste("Play decision:", n$info)) @ As a variation of this approach, an extended formatting with multiple lines can be easily accomodated by supplying a character vector in every node: % <>= print(py, terminal_panel = function(n) c(", then the play decision is:", toupper(n$info))) @ % The same type of approach can also be used in the default \fct{plot} method (with the main difference that the panel function operates on the \code{info} directly rather than on the \class{partynode}). % <>= plot(py, tp_args = list(FUN = function(i) c("Play decision:", toupper(i)))) @ % See the right panel of Figure~\ref{fig:plot-customization} for the resulting graphic. Many more elaborate panel functions are provided in \pkg{partykit}, especially for not only showing text in the visualizations but also statistical graphics. Some of these are briefly illustrated in this and the other package vignettes. Programmers that want to write their own panel functions are advised to inspect the corresponding \proglang{R} source code to see how flexible (but sometimes also complicated) these panel functions are. \setkeys{Gin}{width=0.48\textwidth} \begin{figure}[t!] \centering <>= plot(py[6]) @ <>= <> @ \caption{\label{fig:plot-customization} Visualization of subtree (left) and tree with custom text in terminal nodes (right).} \end{figure} \setkeys{Gin}{width=\textwidth} Finally, an important utility function is \fct{nodeprune} which allows to prune \class{party} trees. It takes a vector of node IDs and prunes all of their kids, i.e., making all the indicated node IDs terminal nodes. <>= nodeprune(py, 2) nodeprune(py, c(2, 6)) @ Note that for the pruned versions of this particular \class{party} tree, the new terminal nodes are displayed with a \code{*} rather than the play decision. This is because we did not store any play decisions in the \code{info} of the inner nodes of \code{py}. We could have of course done so initially, or could do so now, or we might want to do so automatically. For the latter, we would have to know how predictions should be obtained from the data and this is briefly discussed at the end of this vignette and in more detail in \code{vignette("constparty", package = "partykit")}. \section{Technical details} \subsection{Design principles} To facilitate reading of the subsequent sections, two design principles employed in the creation of \pkg{partykit} are briefly explained. % \begin{enumerate} \item Many helper utilities are encapsulated in functions that follow a simple naming convention. To extract/compute some information \emph{foo} from splits, nodes, or trees, \pkg{partykit} provides \emph{foo}\code{_split}, \emph{foo}\code{_node}, \emph{foo}\code{_party} functions (that are applicable to \class{partysplit}, \class{partynode}, and \class{party} objects, repectively). An example for the information \emph{foo} might be \code{kidids} or \code{info}. Hence, in the printing example above using \code{info_node(n)} rather than \code{n$info} for a node \code{n} would have been the preferred syntax; at least when programming new functionality on top of \pkg{partykit}. \item As already illustrated above, printing and plotting relies on \emph{panel functions} that visualize and/or format certain aspects of the resulting display, e.g., that of inner nodes, terminal nodes, headers, footers, etc. Furthermore, arguments like \code{terminal_panel} can also take \emph{panel-generating functions}, i.e., functions that produce a panel function when applied to the \class{party} object. \end{enumerate} \subsection{Splits} \label{sec:splits} \subsubsection{Overview} A split is basically a function that maps data -- or more specifically a partitioning variable -- to daugther nodes. Objects of class \class{partysplit} are designed to represent such functions and are set up by the \fct{partysplit} constructor. For example, a binary split in the numeric partitioning variable \code{humidity} (the 3rd variable in \code{WeatherPlay}) at the breakpoint \code{75} can be created (as above) by % <>= sp_h <- partysplit(3L, breaks = 75) class(sp_h) @ % The internal structure of class \class{partysplit} contains information about the partitioning variable, the splitpoints (or cutpoints or breakpoints), the handling of splitpoints, the treatment of observations with missing values and the kid nodes to send observations to: % <>= unclass(sp_h) @ % Here, the splitting rule is \code{humidity} $\le 75$: % <>= character_split(sp_h, data = WeatherPlay) @ % This representation of splits is completely abstract and, most importantly, independent of any data. Now, data comes into play when we actually want to perform splits: % <>= kidids_split(sp_h, data = WeatherPlay) @ % For each observation in \code{WeatherPlay} the split is performed and the number of the kid node to send this observation to is returned. Of course, this is a very complicated way of saying % <>= as.numeric(!(WeatherPlay$humidity <= 75)) + 1 @ \subsubsection{Mathematical notation} To explain the splitting strategy more formally, we employ some mathematical notation. \pkg{partykit} considers a split to represent a function $f$ mapping an element $x = (x_1, \dots, x_p)$ of a $p$-dimensional sample space $\mathcal{X}$ into a set of $k$ daugther nodes $\mathcal{D} = \{d_1, \dots, d_k\}$. This mapping is defined as a composition $f = h \circ g$ of two functions $g: \mathcal{X} \rightarrow \mathcal{I}$ and $h: \mathcal{I} \rightarrow \mathcal{D}$ with index set $\mathcal{I} = \{1, \dots, l\}, l \ge k$. Let $\mu = (-\infty, \mu_1, \dots, \mu_{l - 1}, \infty)$ denote the split points ($(\mu_1, \dots, \mu_{l - 1})$ = \code{breaks}). We are interested to split according to the information contained in the $i$-th element of $x$ ($i$ = \code{varid}). For numeric $x_i$, the split points are also numeric. If $x_i$ is a factor at levels $1, \dots, K$, the default split points are $\mu = (-\infty, 1, \dots, K - 1, \infty)$. The function $g$ essentially determines, which of the intervals (defined by $\mu$) the value $x_i$ is contained in ($I$ denotes the indicator function here): \begin{eqnarray*} x \mapsto g(x) = \sum_{j = 1}^l j I_{\mathcal{A}(j)}(x_i) \end{eqnarray*} where $\mathcal{A}(j) = (\mu_{j - 1}, \mu_j]$ for \code{right = TRUE} except $\mathcal{A}(l) = (\mu_{l - 1}, \infty)$. If \code{right = FALSE}, then $\mathcal{A}(j) = [\mu_{j - 1}, \mu_j)$ except $\mathcal{A}(1) = (-\infty, \mu_1)$. Note that for a categorical variable $x_i$ and default split points, $g$ is simply the identity. Now, $h$ maps from the index set $\mathcal{I}$ into the set of daugther nodes: \begin{eqnarray*} f(x) = h(g(x)) = d_{\sigma_{g(x)}} \end{eqnarray*} where $\sigma = (\sigma_1, \dots, \sigma_l) \in \{1, \dots, k\}^l$ (\code{index}). By default, $\sigma = (1, \dots, l)$ and $k = l$. If $x_i$ is missing, then $f(x)$ is randomly drawn with $\mathbb{P}(f(x) = d_j) = \pi_j, j = 1, \dots, k$ for a discrete probability distribution $\pi = (\pi_1, \dots, \pi_k)$ over the $k$ daugther nodes (\code{prob}). In the simplest case of a binary split in a numeric variable $x_i$, there is only one split point $\mu_1$ and, with $\sigma = (1, 2)$, observations with $x_i \le \mu_1$ are sent to daugther node $d_1$ and observations with $x_i > \mu_1$ to $d_2$. However, this representation of splits is general enough to deal with more complicated set-ups like surrogate splits, where typically the index needs modification, for example $\sigma = (2, 1)$, categorical splits, i.e., there is one data structure for both ordered and unordered splits, multiway splits, and functional splits. The latter can be implemented by defining a new artificial splitting variable $x_{p + 1}$ by means of a potentially very complex function of $x$ later used for splitting. \subsubsection{Further examples} Consider a split in a categorical variable at three levels where the first two levels go to the left daugther node and the third one to the right daugther node: % <>= sp_o2 <- partysplit(1L, index = c(1L, 1L, 2L)) character_split(sp_o2, data = WeatherPlay) table(kidids_split(sp_o2, data = WeatherPlay), WeatherPlay$outlook) @ % The internal structure of this object contains the \code{index} slot that maps levels to kid nodes. % <>= unclass(sp_o2) @ % This mapping is also useful with splits in ordered variables or when representing multiway splits: % <>= sp_o <- partysplit(1L, index = 1L:3L) character_split(sp_o, data = WeatherPlay) @ % For a split in a numeric variable, the mapping to daugther nodes can also be changed by modifying \code{index}: % <>= sp_t <- partysplit(2L, breaks = c(69.5, 78.8), index = c(1L, 2L, 1L)) character_split(sp_t, data = WeatherPlay) table(kidids_split(sp_t, data = WeatherPlay), cut(WeatherPlay$temperature, breaks = c(-Inf, 69.5, 78.8, Inf))) @ \subsubsection{Further comments} The additional argument \code{prop} can be used to specify a discrete probability distribution over the daugther nodes that is used to map observations with missing values to daugther nodes. Furthermore, the \code{info} argument and slot can take arbitrary objects to be stored with the split (for example split statistics). Currently, no specific structure of the \code{info} is used. Programmers that employ this functionality in their own functions/packages should access the elements of a \class{partysplit} object by the corresponding accessor function (and not just the \code{$} operator as the internal structure might be changed/extended in future release). \subsection{Nodes} \label{sec:nodes} \subsubsection{Overview} Inner and terminal nodes are represented by objects of class \class{partynode}. Each node has a unique identifier \code{id}. A node consisting only of such an identifier (and possibly additional information in \code{info}) is a terminal node: % <>= n1 <- partynode(id = 1L) is.terminal(n1) print(n1) @ % Inner nodes have to have a primary split \code{split} and at least two daugther nodes. The daugther nodes are objects of class \class{partynode} itself and thus represent the recursive nature of this data structure. The daugther nodes are pooled in a list \code{kids}. In addition, a list of \class{partysplit} objects offering surrogate splits can be supplied in argument \code{surrogates}. These are used in case the variable needed for the primary split has missing values in a particular data set. The IDs in a \class{partynode} should be numbered ``depth first'' (sometimes also called ``infix'' or ``pre-order traversal''). This simply means that the root node has identifier 1; the first kid node has identifier 2, whose kid (if present) has identifier 3 and so on. If other IDs are desired, then one can simply set \fct{names} (see above) for the tree; however, internally the depth-first numbering needs to be used. Note that the \fct{partynode} constructor also allows to create \class{partynode} objects with other ID schemes as this is necessary for growing the tree. If one wants to assure the a given \class{partynode} object has the correct IDs, one can simply apply \fct{as.partynode} once more to assure the right order of IDs. Finally, let us emphasize that \class{partynode} objects are not directly connected to the actual data (only indirectly through the associated \class{partysplit} objects). \subsubsection{Examples} Based on the binary split \code{sp_h} defined in the previous section, we set up an inner node with two terminal daugther nodes and print this stump (the data is needed because neither split nor nodes contain information about variable names or levels): % <>= n1 <- partynode(id = 1L, split = sp_o, kids = lapply(2L:4L, partynode)) print(n1, data = WeatherPlay) @ % Now that we have defined this simple tree, we want to assign observations to terminal nodes: % <>= fitted_node(n1, data = WeatherPlay) @ % Here, the \code{id}s of the terminal node each observations falls into are returned. Alternatively, we could compute the position of these daugther nodes in the list \code{kids}: % <>= kidids_node(n1, data = WeatherPlay) @ % Furthermore, the \code{info} argument and slot takes arbitrary objects to be stored with the node (predictions, for example, but we will handle this issue later). The slots can be extracted by means of the corresponding accessor functions. \subsubsection{Methods} A number of methods is defined for \class{partynode} objects: \fct{is.partynode} checks if the argument is a valid \class{partynode} object. \fct{is.terminal} is \code{TRUE} for terminal nodes and \code{FALSE} for inner nodes. The subset method \code{[} returns the \class{partynode} object corresponding to the \code{i}-th kid. The \fct{as.partynode} and \fct{as.list} methods can be used to convert flat list structures into recursive \class{partynode} objects and vice versa. As pointed out above, \fct{as.partynode} applied to \class{partynode} objects also renumbers the recursive nodes starting with root node identifier \code{from}. Furthermore, many of the methods defined for the class \class{party} illustrated above also work for plain \class{partynode} objects. For example, \fct{length} gives the number of kid nodes of the root node, \fct{depth} the depth of the tree and \fct{width} the number of terminal nodes. \subsection{Trees} \label{sec:trees} Although tree structures can be represented by \class{partynode} objects, a tree is more than a number of nodes and splits. More information about (parts of the) corresponding data is necessary for high-level computations on trees. \subsubsection{Trees and data} First, the raw node/split structure needs to be associated with a corresponding data set. % <>= t1 <- party(n1, data = WeatherPlay) t1 @ % Note that the \code{data} may have zero rows (i.e., only contain variable names/classes but not the actual data) and all methods that do not require the presence of any learning data still work fine: % <>= party(n1, data = WeatherPlay[0, ]) @ \subsubsection{Response variables and regression relationships} Second, for decision trees (or regression and classification trees) more information is required: namely, the response variable and its fitted values. Hence, a \class{data.frame} can be supplied in \code{fitted} that has at least one variable \code{(fitted)} containing the terminal node numbers of data used for fitting the tree. For representing the dependence of the response on the partitioning variables, a \code{terms} object can be provided that is leveraged for appropriately preprocessing new data in predictions. Finally, any additional (currently unstructured) information can be stored in \code{info} again. % <>= t2 <- party(n1, data = WeatherPlay, fitted = data.frame( "(fitted)" = fitted_node(n1, data = WeatherPlay), "(response)" = WeatherPlay$play, check.names = FALSE), terms = terms(play ~ ., data = WeatherPlay), ) @ % The information that is now contained in the tree \code{t2} is sufficient for all operations that should typically be performed on constant-fit trees. For this type of trees there is also a dedicated class \class{constparty} that provides some further convenience methods, especially for plotting and predicting. If a suitable \class{party} object like \code{t2} is already available, it just needs to be coerced: % <>= t2 <- as.constparty(t2) t2 @ % \setkeys{Gin}{width=0.6\textwidth} \begin{figure}[t!] \centering <>= plot(t2, tnex = 1.5) @ \caption{\label{fig:constparty-plot} Constant-fit tree for \code{play} decision based on weather conditions in \code{WeatherPlay} data.} \end{figure} \setkeys{Gin}{width=\textwidth} % As pointed out above, \class{constparty} objects have enhanced \fct{plot} and \fct{predict} methods. For example, \code{plot(t2)} now produces stacked bar plots in the leaves (see Figure~\ref{fig:constparty-plot}) as \code{t2} is a classification tree For regression and survival trees, boxplots and Kaplan-Meier curves are employed automatically, respectively. As there is information about the response variable, the \fct{predict} method can now produce more than just the predicted node IDs. The default is to predict the \code{"response"}, i.e., a factor for a classification tree. In this case, class probabilities (\code{"prob"}) are also available in addition to the majority votings. % <>= nd <- data.frame(outlook = factor(c("overcast", "sunny"), levels = levels(WeatherPlay$outlook))) predict(t2, newdata = nd, type = "response") predict(t2, newdata = nd, type = "prob") predict(t2, newdata = nd, type = "node") @ More details on how \class{constparty} objects and their methods work can be found in the corresponding \code{vignette("constparty", package = "partykit")}. \section{Summary} This vignette (\code{"partykit"}) introduces the package \pkg{partykit} that provides a toolkit for computing with recursive partytions, especially decision/regression/classification trees. In this vignette, the basic \class{party} class and associated infrastructure are discussed: splits, nodes, and trees with functions for printing, plotting, and predicting. Further vignettes in the package discuss in more detail the tools built on top of it. \bibliography{party} \end{document}