CLEARDAO BUYBACK AND BURN SALES PROGRAM

We are pleased to say that we have initiated the first $CLH burn to Earn (buyback) sales program, In total we have generated approximately $91,401,649.3 in quarterly profits from asset management…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How to Overcome Overfitting with Bagged Decision Trees

A Simple Step-to-Step Guide on Bagging in Python and in Jupyter Notebook

Sometimes supervised learning algorithms don’t perform well on the data. There could be many causes and reasons for it. The model might be too complex. If a model is too complicated, the model could fit the data too well. Decision trees are prone to overfit the data. Overfitting is difficult to tackle. We could get more data. We could reduce the number of features to make the model less complex. We could add a new term to the cost function to adjust the coefficients toward a low value. Or we can add the predictions of several decision trees.

In this article, we will cover the following topics.

Variance measures the spread of the predicted values compared to the true values. High variance means that predictions are very scattered and unstable. Low variance means that predictions are consistent and not very scattered.

An overfitting model can be also too complex to generalise to the validation dataset. This model has a high variance. Decision trees are prone to overfit the data.

Decision trees are one of the most common algorithms in machine learning today. It is a supervised learning algorithm. We use it for regression and classification problems. It follows a set of if-else conditions to visualise and classify the data. The process is as follows:

Figure 1: Decision Tree

Random Forest Algorithms are “bagged decision trees”. As mentioned before, decision trees split each node to minimise the entropy of the leaf nodes. Random Forest attempts to overcome this problem. It bootstraps the data points in the training dataset. It also bootstraps the features available for each tree to split on. It doesn’t choose the best features, even though the greedy algorithm is suggesting it. If you wish to read more about bootstrapping, please visit the following article.

First, we will build a Decision Tree classifier. This is our base estimator. To do so, we will import the Decision Tree Classifier library:

Then, we will specify the hyperparameters of the base estimator. Our splitting criterion is entropy. With a random state, we can control the randomness of the estimator.

Then, we will then pass the base estimator object to the classifier as a hyperparameter:

Figure 4: Passing base estimator object to the classifier

Then, we will fit the decision tree model to the training data to compare prediction accuracy:

Figure 5: Fitting the decision tree model

Finally, we will calculate the predicted accuracy for the training data and for the test data:

Figure 6: Calculating predicted accuracy for the training data
Figure 7: Calculating predicted accuracy for test data

We will print the predicted accuracy for training and test data:

Figure 8: Predicted accuracy for training and test data for Decision Tree Classifier

We received an accuracy of 100 % on the training data. The decision tree predicts all the class labels of the training examples correctly. But the accuracy dropped to 73.33 % on the test data. The lower test accuracy indicates a high variance in the model. High variance means that the decision tree is overfitting to the training data.

First, we will implement generic bagging ensembles for classification tasks. Then, we will implement the Random Forest classifier.

We will import the Bagging Classifier library:

Figure 9: Importing Bagging Classifier library
Figure 10: Hyperparameter of the Bagging Classifier

Same as the decision tree classifier, we will first pass the object to the classifier. Then, we will fit the model. Finally, we will fit the classifier into the training data and into the test data.

We will print the predicted accuracy for training and test data:

Figure 11: Predicted accuracy for training and test data for Bagging Classifier

We received an accuracy of 94.98 % on the training data. The Bagging classifier has a lower accuracy on the training data. But the Bagging Classifier has a higher accuracy on the test data. The accuracy dropped only to 85.00 % on the test data.

We will import the Random Forest Classifier.

Figure 12: Importing Random Forest Classifier Library

We will specify the hyperparameters and initialize the model. We will use entropy as the splitting criterion for the decision trees with 100 trees.

Figure 13: Hyperparameter of the Random Forest Classifier

We will fit the Random Forest classifier model to the training and to the test data and calculate the prediction accuracy.

We received an accuracy of 87.45 % on the training data. But the accuracy improved to 90.00 % on the test data. The accuracy of the Random Forest on the test set compared to the bagging classifier is almost the same. Even though the Bagging Classifier has higher accuracy on the training dataset.

We implemented a generic bagging ensemble and the random forest classifier in Python. We showed how the accuracy of the test data improved for bagging ensembles compared to the decision tree classifier.

If you like this article, please clap. If you wish to read similar articles from me, please follow me to receive an email whenever I publish a new article.

References:

Add a comment

Related posts:

Enhancing the balcony experience with Design Thinking Process.

Welcome to my first design thinking case study where I focused on improving the balcony experience for my users using the design thinking process. The task was to choose a part of the house and…

Quiet Revolutions in Money and Social Media

There are a couple quiet revolution taking place. Although one is about money and the other is about social media, in my mind they are completely interrelated… Money Money as a medium of exchanging…

Unexpected Locales to Soak Up the Long L.A. Summer

As soon as autumn hits the east coast and maple leaves begin to fall (which, fine, is pretty for about two minutes), us Angelenos are still basking in the warm summer sun. In fact, some may even say…