How are feature_importances in RandomForestClassifier determined?

how to check variable importance in random forest in python
gini importance
rfpimp
feature importance
randomforestclassifier example
randomforestclassifier decision_function
permutation feature importance
random forest feature importance positive or negative

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_, which works well for me.

However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic.

There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.

In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.

In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.

(Note that both algorithms are available in the randomForest R package.)

[1]: Breiman, Friedman, "Classification and regression trees", 1984.

Running Random Forests? Inspect The Feature Importances With , This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of  Because that is their method, the sklearn instances of these models have a .feature_importances_ attribute, which returns an array of each feature’s importance in determining the splits. Looking at these can be super helpful, but the .feature_importances_ attribute just prints an array of numbers.

The usual way to compute the feature importance values of a single tree is as follows:

  1. you initialize an array feature_importances of all zeros with size n_features.

  2. you traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].

The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy, MSE, ...). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.

Its important that these values are relative to a specific dataset (both error reduction and the number of samples are dataset specific) thus these values cannot be compared between different datasets.

As far as I know there are alternative ways to compute feature importance values in decision trees. A brief description of the above method can be found in "Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Feature importances with forests of trees, The usual way to compute the feature importance values of a single tree is as follows: you initialize an array feature_importances of all zeros with size n_features . feature_importances_ in Scikit-Learn is based on that logic, but in the case of Random Forest, we are talking about averaging the decrease in impurity over trees. Pros: fast calculation; easy to retrieve — one command; Cons: biased approach, as it has a tendency to inflate the importance of continuous features or high-cardinality categorical variables

It's the ratio between the number of samples routed to a decision node involving that feature in any of the trees of the ensemble over the total number of samples in the training set.

Features that are involved in the top level nodes of the decision trees tend to see more samples hence are likely to have more importance.

Edit: this description is only partially correct: Gilles and Peter's answers are the correct answer.

Feature Importance Measures for Tree Models, Note that these measures are purely calculated using training data, so there's a chance How are feature_importances in RandomForestClassifier determined? Feature Importances returns an array where each index corresponds to the estimated feature importance of that feature in the training set. There is no sorting done internally, it is a 1-to-1 correspondence with the features given to it during training.

As @GillesLouppe pointed out above, scikit-learn currently implements the "mean decrease impurity" metric for feature importances. I personally find the second metric a bit more interesting, where you randomly permute the values for each of your features one-by-one and see how much worse your out-of-bag performance is.

Since what you're after with feature importance is how much each feature contributes to your overall model's predictive performance, the second metric actually gives you a direct measure of this, whereas the "mean decrease impurity" is just a good proxy.

If you're interested, I wrote a small package that implements the Permutation Importance metric and can be used to calculate the values from an instance of a scikit-learn random forest class:

https://github.com/pjh2011/rf_perm_feat_import

Edit: This works for Python 2.7, not 3

random_forest, We can initialize an array feature_importances of all zeros with size n_features; We start How are feature_importances in RandomForestClassifier determined? property feature_importances_¶ Return the feature importances (the higher, the more important the. feature). Returns feature_importances_ array, shape = [n_features] The values of this array sum to 1, unless all trees are single node trees consisting of only the root node, in which case it will be an array of zeros.

Let me try answer the question. code:

iris = datasets.load_iris()  
X = iris.data  
y = iris.target  
clf = DecisionTreeClassifier()  
clf.fit(X, y)  

decision_tree plot: enter image description here We can get compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071] Check source code:

cpdef compute_feature_importances(self, normalize=True):
    """Computes the importance of each feature (aka variable)."""
    cdef Node* left
    cdef Node* right
    cdef Node* nodes = self.nodes
    cdef Node* node = nodes
    cdef Node* end_node = node + self.node_count

    cdef double normalizer = 0.

    cdef np.ndarray[np.float64_t, ndim=1] importances
    importances = np.zeros((self.n_features,))
    cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data

    with nogil:
        while node != end_node:
            if node.left_child != _TREE_LEAF:
                # ... and node.right_child != _TREE_LEAF:
                left = &nodes[node.left_child]
                right = &nodes[node.right_child]

                importance_data[node.feature] += (
                    node.weighted_n_node_samples * node.impurity -
                    left.weighted_n_node_samples * left.impurity -
                    right.weighted_n_node_samples * right.impurity)
            node += 1

    importances /= nodes[0].weighted_n_node_samples

    if normalize:
        normalizer = np.sum(importances)

        if normalizer > 0.0:
            # Avoid dividing by zero (e.g., when root is pure)
            importances /= normalizer

    return importances

Try calculate the feature importance:

print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))

We get feature_importance: np.array([0,1.332,6.418,92.30]). After normalized, we can get array ([0., 0.01331334, 0.06414793, 0.92253873]),this is same as clf.feature_importances_. Be careful all classes are supposed to have weight one.

Feature importances in random forest, I'm using the random forest classifier ( RandomForestClassifier ) from I think the importance scores are calculated by averaging the feature importances of  Variable Importance in Random Forests can suffer from severe overfitting Variable Importance in Random Forests can suffer from severe overfitting Predictive vs. interpretational overfitting There appears to be broad consenus that random forests rarely suffer from “overfitting” which plagues many other models. (We define overfitting as choosing a model flexibility which is too high for the

Feature selection using feature importances in random forests with , The plot is based on the attribute feature_importances_ and I use the classifier sklearn.ensemble.RandomForestClassifier . I am aware that there  # Calculate feature importances importances = model.feature_importances_ Visualize Feature Importance. # Sort feature importances in descending order indices = np.argsort(importances) [::-1] # Rearrange feature names so they match the sorted feature importances names = [iris.feature_names[i] for i in indices] # Create plot plt.figure() # Create

Feature Importances, Feature importances of Random Forest classifier. The above If None is automatically determined by the underlying model and options provided. stack​bool  Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically

Selecting good features – Part III: random forests, exposed in sklearn's Random Forest implementations (random forest classifier and When we compute the feature importances, we see that X_1 is In, the problem of determining the best feature using Random forests,  You can simply use the feature_importances_ attribute to select the features with the highest importance score. So for example you could use the following function to select the K best features according to importance. def selectKImportance(model, X, k=5): return X[:,model.feature_importances_.argsort()[::-1][:k]]

Comments
  • Woah three core devs on in one SO thread. That's gotta be some kind of record ^^
  • It could be great if this answer was mentioned in the documentation of the importance attributes/example. Been searching for it for awhile too :)
  • It seems the importance score is in relative value? For example, the sum of the importance scores of all features is always 1 (see the example here scikit-learn.org/stable/auto_examples/ensemble/…)
  • @RNA: Yes, by default variable importances are normalized in scikit-learn, such that they sum to one. You can circumvent this by looping over the individual base estimators and calling tree_.compute_feature_importances(normalize=False).
  • @GillesLouppe Do you use the out of bag samples to measure the reduction in MSE for a forest of decision tree regressors in each tree? Or all training data used on the tree?
  • Two useful resources. (1) blog.datadive.net/… a blog by Ando Saabas implements both "mean decrease impurity" and also "mean decrease in accuracy" as mentioned by Gilles. (2) Download and read Gilles Louppe's thesis.
  • Do you know if there is some paper/documentation about the exact method? eg. Breiman, 2001. It would be great if I had some proper document, which I could cite for the methodology.
  • @ogrisel it would be great if you could clearly mark your response as the explanation for the "weighting". The weighting alone does not determine the feature importance. The "impurity metric" ("gini-importance" or RSS) combined with the weights, averaged over trees determines the overall feature importance. Unfortunately the documentation on scikit-learn here: scikit-learn.org/stable/modules/… is not accurate and incorrectly mentions "depth" as the impurity metric.
  • Hi @Peter when I use your code I get this error: NameError: name 'xrange' is not defined.
  • Hi @Aizzaac. Sorry I'm new to writing packages, so I should've noted I wrote it for Python 2.7. Try def xrange(x): return iter(range(x)) before running it