Getting the observations in a rpart's node (i.e.: CART)

Related searches

I would like to inspect all the observations that reached some node in an rpart decision tree. For example, in the following code:

fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis)
fit

n= 81 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 81 17 absent (0.79012346 0.20987654)  
   2) Start>=8.5 62  6 absent (0.90322581 0.09677419)  
     4) Start>=14.5 29  0 absent (1.00000000 0.00000000) *
     5) Start< 14.5 33  6 absent (0.81818182 0.18181818)  
      10) Age< 55 12  0 absent (1.00000000 0.00000000) *
      11) Age>=55 21  6 absent (0.71428571 0.28571429)  
        22) Age>=111 14  2 absent (0.85714286 0.14285714) *
        23) Age< 111 7  3 present (0.42857143 0.57142857) *
   3) Start< 8.5 19  8 present (0.42105263 0.57894737) *

I would like to see all the observations in node (5) (i.e.: the 33 observations for which Start>=8.5 & Start< 14.5). Obviously I could manually get to them. But I would like to have some function like (say) "get_node_date". For which I could just run get_node_date(5) - and get the relevant observations.

Any suggestions on how to go about this?

There seems to be no such function which enables an extraction of the observations from a specific node. I would solve it as follows: first determine which rule/s is/are used for the node you are insterested in. You can use path.rpart for it. Then you could apply the rule/s one after the other to extract the observations.

This approach as a function:

get_node_date <- function(tree = fit, node = 5){
  rule <- path.rpart(tree, node)
  rule_2 <- sapply(rule[[1]][-1], function(x) strsplit(x, '(?<=[><=])(?=[^><=])|(?<=[^><=])(?=[><=])', perl = TRUE))
  ind <- apply(do.call(cbind, lapply(rule_2, function(x) eval(call(x[2], kyphosis[,x[1]], as.numeric(x[3]))))), 1, all)
  kyphosis[ind,]
  }

For node 5 you get:

get_node_date()

 node number: 5 
   root
   Start>=8.5
   Start< 14.5
   Kyphosis Age Number Start
2    absent 158      3    14
10  present  59      6    12
11  present  82      5    14
14   absent   1      4    12
18   absent 175      5    13
20   absent  27      4     9
23  present  96      3    12
26   absent   9      5    13
28   absent 100      3    14
32   absent 125      2    11
33   absent 130      5    13
35   absent 140      5    11
37   absent   1      3     9
39   absent  20      6     9
40  present  91      5    12
42   absent  35      3    13
46  present 139      3    10
48   absent 131      5    13
50   absent 177      2    14
51   absent  68      5    10
57   absent   2      3    13
59   absent  51      7     9
60   absent 102      3    13
66   absent  17      4    10
68   absent 159      4    13
69   absent  18      4    11
71   absent 158      5    14
72   absent 127      4    12
74   absent 206      4    10
77  present 157      3    13
78   absent  26      7    13
79   absent 120      2    13
81   absent  36      4    13

Decision Trees in R using rpart, To see how it works, let's get started with a minimal example. and minbucket . minsplit is “the minimum number of observations that must exist� Documentation reproduced from package rpart, version 4.1-15, License: GPL-2 | GPL-3 Community examples sayandutta13@gmail.com at Feb 28, 2017 rpart v4.1-10

rpart returns rpart.object element which contains the information you need:

require(rpart)
fit2 <- rpart(Kyphosis ~ Age + Start, data = kyphosis)
fit2

get_node_date <-function(nodeId,fit)
{  
  fit$frame[toString(nodeId),"n"]
}


for (i in c(1,2,4,5,10,11,22,23,3) )
  cat(get_node_date(i,fit2),"\n")

[PDF] An Introduction to Recursive Partitioning Using the RPART , The surrogate sends 126 of the 146 observations the correct direction for an agreement of. 0.863. The majority rule gets 85 correct, and the� If the rpart object is a classification tree, then the default is to return prob predictions, a matrix whose columns are the probability of the first, second, etc. class. (This agrees with the default behavior of tree ).

The partykit package also provides a canned solution for this. You just need to convert the rpart object to the party class in order to use its unified interface for dealing with trees. And then you can use the data_party() function.

Using the fit from the question and having loaded library("partykit") you can first coerce the rpart tree to party:

pfit <- as.party(fit)
plot(pfit)

There are only two small nuisances for extracting the data in the way you want: (1) The model.frame() from the original fit is always dropped in the coercion and needs to be reattached manually. (2) A different numbering scheme is used for the nodes. You want node 4 (rather than 5) now.

pfit$data <- model.frame(fit)
data4 <- data_party(pfit, 4)
dim(data4)
## [1] 33  5
head(data4)
##    Kyphosis Age Start (fitted) (response)
## 2    absent 158    14        7     absent
## 10  present  59    12        8    present
## 11  present  82    14        8    present
## 14   absent   1    12        5     absent
## 18   absent 175    13        7     absent
## 20   absent  27     9        5     absent

Another route is to subset the subtree starting from node 4 and then taking the data from that:

pfit4 <- pfit[4]
plot(pfit4)

Then data_party(pfit4) gives you the same as data4 above. And pfit4$data gives you the data without the (fitted) node and the predicted (response).

rpart.predict.leaves: Return the leaf into which observations are , The "where" element of an rpart object gives the leaf into which each observation used building the tree falls. This produces the equivalent for new data. Report wildlife observations Interested parties can explore and submit wildlife point observations that are a priority for the Washington Department of Fish and Wildlife. Priority observations include Washington Species of Greatest Conservation Need (SGCN) and state species of concern .

Yet another way, this works by finding all of the terminal nodes of any particular node and returning the subset of data used in the call.

fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis)

head(subset.rpart(fit, 5))
#    Kyphosis Age Number Start
# 2    absent 158      3    14
# 10  present  59      6    12
# 11  present  82      5    14
# 14   absent   1      4    12
# 18   absent 175      5    13
# 20   absent  27      4     9


subset.rpart <- function(tree, node = 1L) {
  data <- eval(tree$call$data, parent.frame(1L))
  wh <- sapply(as.integer(rownames(tree$frame)), parent)
  wh <- unique(unlist(wh[sapply(wh, function(x) node %in% x)]))
  data[rownames(tree$frame)[tree$where] %in% wh[wh >= node], ]
}

parent <- function(x) {
  if (x[1] != 1)
    c(Recall(if (x %% 2 == 0L) x / 2 else (x - 1) / 2), x) else x
}

rpart function, Fit a rpart model. Get FREE Unlimited Access through September 9! the default action deletes all observations for which y is missing, but keeps those in� This function is a method for the generic function predict for class "rpart". It can be invoked by calling predict for an object of the appropriate class, or directly by calling predict.rpart regardless of the class of the object. Value. A new object is obtained by dropping newdata down the object.

Two years after original post, but may be of use to others. Node assignments for training observations in rpart can be obtained from $where:

fit <- rpart(Kyphosis ~ Age + Start, data = kyphosis)
fit$where

As a function:

get_node <- function(rpart.object=fit, data=kyphosis, node.number=5) {
  data[which(fit$where == node.number),]  
}
get_node()

This works for training observations only though, not for new observations.

how does rpart handle missing values in predictors?, This is where the surrogate variables come in - for each split, observations where the split variable is missing are split based on the best surrogate variable,� Package ‘rpart’ April 12, 2019 Priority recommended Version 4.1-15 Date 2019-04-10 Description Recursive partitioning for classification, regression and survival trees.

These observations, are listed on an FDA Form 483 when, in an investigator’s judgment, the observed conditions or practices indicate that an FDA-regulated product may be in violation of FDA’s

Observations conducted by in-building administrators, e.g., the principal, are more stable (0.61) than those done by central administration staff (0.49), but observations conducted by evaluators from outside the building have higher predictive power for value-added scores in the next year (0.21) than those done by administrators in the building

Your communication may even inspire families to share more of their own observations with you. Getting to know infants and toddlers better. Through ongoing observation, you learn about children’s strengths, needs, knowledge, interests, and skills, and you uncover any barriers there may be to learning.

Comments
  • You don't get the observations through this, but only the number of abservations which fall into a category
  • if you used ptree$data <- model.frame(eval(tree$call$data)) the variables not used in the formula wouldnt be dropped
  • True...but only if data contains all variables in the formula which is not necessarily the case. With the model.frame() you also get transformed variables, e.g., log(), Surv() or factor() versions of variables that are often created on the fly.
  • BTW: The as.party() coercion for rpart objects now keeps the data by default! Thus, you can do as.party(fit, data = TRUE) (which is the new default) or as.party(fit, data = FALSE) (which corresponds to the old behavior).