8/28/2023 0 Comments Variable importance random forest![]() In problems like loan default of fraud prediction we are interested in detecting all positive cases, as the cost of a false negative is much higher than a false positive. Datasets with class imbalance are hard to predict, as algorithms can reach high values of accuracy predicting most observations as negative. This is a dataset with class imbalance, where the vast majority of observations belong to a category, usually the negative. Note that less than 25% of cases are positive. Note the tweak in geom_bar and scale_y_continuous to present percent of cases rather than total cases. Let’s examine the number of positives and negatives of each class. The dependent variable is default, encoded in zero / one format. I will be using the ranger (RANdom forest GEneRator) package, skimr to get a summary of data, rpart and ot to generate an alternative decision tree model, BAdatasets to access the dataset, tidymodels for prediction workflow facilities and forcats for the variable importance plot. This dataset has two interesting features: the number of positive cases is much smaller than the negatives and requires some preprocessing of the existing features. I will test this technique in a LoanDefaults dataset to predict which customers will default the paying of a loan in a specific month. We do this to obtain trees that are not necessarily using the strongest predictors at the beginning. In this post, I will present how to use random forests in classification, a prediction technique consisting in generating a set of trees (hence, a forest) bootstrapping the features used in each tree. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |