How can I improve the accuracy of my prediction from a decision tree model using sklearn?

decision tree python code example
decision tree python code from scratch
improve decision tree accuracy python
sklearn decision tree visualization
decision tree classifier example
python id3 decision tree implementation
sklearn decision tree feature importance
decision tree dataset csv

I have created a decision tree model in Python using sklearn, and it takes data from a large public data set that relates human factors (age, bmi, sex, smoking, etc) to cost of medical care that insurance companies pay each year. I split the data set with a test size of .2, but mean absolute error and mean squared error are incredibly high. I tried doing different splits (.5, .8) but I have not gotten any different results. The prediction model appears to be quite off in some areas but I am not sure what part is lacking and what I need to improve. I have attached photos of my output (through an IMGUR link as I cannot add photos) as well as my code, and I appreciate any guidance in the right direction!

dataset = pd.read_csv('insurance.csv')

LE = LabelEncoder() = LE.transform(
dataset.smoker = LE.transform(dataset.smoker)
dataset.region = LE.transform(dataset.region)

print("Data Head")
print("Data Info")

for i in dataset.columns:
    print('Null Values in {i} :'.format(i = i) , dataset[i].isnull().sum())

X = dataset.drop('charges', axis = 1) 
y = dataset['charges'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)  

regressor = DecisionTreeRegressor(), y_train)  

y_pred = regressor.predict(X_test) 

df = pd.DataFrame({'Actual Value': y_test, 'Predicted Values': y_pred})  

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Certain things you can do if you are not doing already:

  1. Use StandardScaler() from scikit-learn on non-categorical columns/features.
  2. Use GridSearchCV from scikit-learn to search for appropriate hyper-parameters, instead of doing it manually. Although, choosing to do so manually may give you some sense of which parameter values might work.
  3. Check the documentation of DecisionTreeRegressor carefully to make sure that your implementation is in agreement with the documentation.

See if this helps.

Decision Tree Classification in Python, Accuracy can be computed by comparing actual test set values and predicted values. Well, you got a classification rate of 67.53%, considered as good accuracy. You can improve this accuracy by tuning the parameters in the Decision Tree Algorithm. Return the decision path in the tree. fit (self, X, y[, sample_weight, …]) Build a decision tree classifier from the training set (X, y). get_depth (self) Return the depth of the decision tree. get_n_leaves (self) Return the number of leaves of the decision tree. get_params (self[, deep]) Get parameters for this estimator.

You can use xgboost, which is the using a boosting algorithm.

Decision Trees in Python with Scikit-Learn, For each attribute in the dataset, the decision tree algorithm forms a node, where the This provides you with a more accurate view of how your trained algorithm will Now that our classifier has been trained, let's make predictions on the test data. Improve your skills by solving one coding problem every day; Get the� sklearn.metrics.accuracy_score¶ sklearn.metrics.accuracy_score (y_true, y_pred, *, normalize=True, sample_weight=None) [source] ¶ Accuracy classification score. In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Bootstrap Aggregating ( would be an easy way to reduce the variance of your estimator. There is little additional code needed if you are already using an sklearn regressor. Below is an example of how you could use a simply bagged estimator to reduce the variance of your model:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)  

regressor = DecisionTreeRegressor() 
b_regressor = BaggingRegressor(regressor, n_estimators = 100, max_features=3, max_samples=.5)  # get Boostrap aggregation ensemble regressor 

# Fit+predict using regular regressor, y_train)  
y_pred = regressor.predict(X_test) 

# Fit predict using bootstrap aggregation, y_train)  
y_b_pred = b_regressor.predict(X_test) 

df = pd.DataFrame({'Actual Value': y_test, 'Predicted Values': y_pred, 'Bagging Predicted Values': y_b_pred})  

print('Mean Absolute Error (Regular):', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error (Regular):', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (Regular):', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

print('Mean Absolute Error (Bagging):', metrics.mean_absolute_error(y_test, y_b_pred))
print('Mean Squared Error (Bagging):', metrics.mean_squared_error(y_test, y_b_pred))
print('Root Mean Squared Error (Bagging):', np.sqrt(metrics.mean_squared_error(y_test, y_b_pred)))

Improve Precision of a binary classifier, You could try to play on this parameter to re-balance your results in class 0 and Remember than when calling for method .predict(), sklearn decision tree will� You are getting 100% accuracy because you are using a part of training data for testing. At the time of training, decision tree gained the knowledge about that data, and now if you give same data to predict it will give exactly same value. That's why decision tree producing correct results every time.

Decision Tree In Python. An example of how to implement a…, In my opinion, Decision Tree models help highlight how we can use machine learning to enhance our decision making abilities. Suppose that we were trying to build a decision tree to predict whether a person is married. The scikit-learn implementation of the DecisionTreeClassifer uses the minimum� Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts.

Boosting the accuracy of your Machine Learning models, Tired of getting low accuracy on your machine learning models? Boosting is We can improve the prediction accuracy of Decision Trees using Bootstrapping A popular library for implementing this algorithm is Scikit-Learn. We can improve the prediction accuracy of Decision Trees using Bootstrapping. Create many (e.g. 100) random sub-samples of our dataset with replacement (meaning we can select the same value multiple times). Learn(train) a decision tree on each sample. Given new dataset, Calculate the prediction for each sub-sample.

How To Implement The Decision Tree Algorithm From Scratch In , The final decision tree can explain exactly why a specific prediction to predict the most common class value, the baseline accuracy on the problem is about 50 %. from scratch and apply it to your own predictive modeling problems. Increasing the maximum depth to 2, we are forcing the tree to make� The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0. Parameters X array-like of shape (n_samples, n_features) Test samples.

  • Thank you! I definitely followed the documentation but I will look into StandardScaler and GridSearchCV.
  • @Cassie Please consider voting up this answer. Thank you for choosing it as the accepted answer.
  • Thank you! I'll look into that now!
  • Hmm, I actually ended up getting a higher MAE with that! I wonder if there's something I'm missing.