19

I trained an ExtraTreesClassifier (gini index) using scikit-learn and it suits my needs fairly. Not so good accuracy, but using a 10-fold cross validation, AUC is 0.95. I would like to use this classifier on my work. I am quite new to ML, so please forgive me if I'm asking you something conceptually wrong.

I plotted some ROC curves, and by it, its seems I have a specific threshold where my classifier starts performing well. I'd like to set this value on the fitted classifier, so everytime I'd call predict, the classifiers use that threshold and I could believe in the FP and TP rates.

I also came to this post (scikit .predict() default threshold), where its stated that a threshold is not a generic concept for classifiers. But since the ExtraTreesClassifier has the method predict_proba, and the ROC curve is also related to thresdholds definition, it seems to me I should be available to specify it.

I did not find any parameter, nor any class/interface to use to do it. How can I set a threshold for it for a trained ExtraTreesClassifier (or any other one) using scikit-learn?

Many Thanks, Colis

2 Answers 2

20

This is what I have done:

model = SomeSklearnModel()
model.fit(X_train, y_train)
predict = model.predict(X_test)
predict_probabilities = model.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, predict_probabilities)

However, I am annoyed that predict chooses a threshold corresponding to 0.4% of true positives (false positives are zero). The ROC curve shows a threshold I like better for my problem where the true positives are approximately 20% (false positive around 4%). I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. In my case this probability is 0.21. Then I create my own predict array:

predict_mine = np.where(rf_predict_probabilities > 0.21, 1, 0)

and there you go:

confusion_matrix(y_test, predict_mine)

returns what I wanted:

array([[6927,  309],
       [ 621,  121]])
3
  • 2
    Keep in mind that the resulting confusion matrix is not a proper metric for out of sample performance due to the fact that the threshold was chosen upon the test data which leads to data leakage. The proper way to do this is to split the data into train/validate/test. Train the classifier with the train data, choose the threshold with the validation data and evaluate the final model (threshold included) with the test set.
    – Philipp
    Oct 30, 2019 at 13:51
  • yes you are right, I oversimplified the answer
    – famargar
    Jan 6, 2022 at 11:02
  • I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. Could you elaborate a little bit more in this step? How do you know which probability value corresponds to the ROC point? May 21, 2023 at 3:40
0

It's difficult to provide an exact answer without any specific code examples. If you're already doing cross validation, you might consider specifying the AUC as the parameter to optimize:

shuffle = cross_validation.KFold(len(X_train), n_folds=10, shuffle=True)
scores = cross_val_score(classifier, X_train, y_train, cv=shuffle, scoring='roc_auc')
1
  • 1
    Hi White, thanks for your reply. I optimized it by choosing roc_auc and other metrics that were of my interest at the time (I also created a custom scorer to optimize LR+) . My main doubt is how to choose one of the threshold shown by a point on ROC curve as the threshold for when I call predict() ? My question is related to (<github.com/scikit-learn/scikit-learn/issues/4813>) . I’m not sure this would be available for trees, as they usually to not use probas. But how to set it for other methods, then?
    – Colis
    Jan 26, 2017 at 14:01

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.