How to set a threshold for a sklearn classifier based on ROC results?

Question

I trained an ExtraTreesClassifier (gini index) using scikit-learn and it suits my needs fairly. Not so good accuracy, but using a 10-fold cross validation, AUC is 0.95. I would like to use this classifier on my work. I am quite new to ML, so please forgive me if I'm asking you something conceptually wrong.

I plotted some ROC curves, and by it, its seems I have a specific threshold where my classifier starts performing well. I'd like to set this value on the fitted classifier, so everytime I'd call predict, the classifiers use that threshold and I could believe in the FP and TP rates.

I also came to this post (scikit .predict() default threshold), where its stated that a threshold is not a generic concept for classifiers. But since the ExtraTreesClassifier has the method predict_proba, and the ROC curve is also related to thresdholds definition, it seems to me I should be available to specify it.

I did not find any parameter, nor any class/interface to use to do it. How can I set a threshold for it for a trained ExtraTreesClassifier (or any other one) using scikit-learn?

Many Thanks, Colis

famargar · Accepted Answer · 2017-07-28 12:18:27Z

20

This is what I have done:

model = SomeSklearnModel()
model.fit(X_train, y_train)
predict = model.predict(X_test)
predict_probabilities = model.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, predict_probabilities)

However, I am annoyed that predict chooses a threshold corresponding to 0.4% of true positives (false positives are zero). The ROC curve shows a threshold I like better for my problem where the true positives are approximately 20% (false positive around 4%). I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. In my case this probability is 0.21. Then I create my own predict array:

predict_mine = np.where(rf_predict_probabilities > 0.21, 1, 0)

and there you go:

confusion_matrix(y_test, predict_mine)

returns what I wanted:

array([[6927,  309],
       [ 621,  121]])

edited Jul 28, 2017 at 12:18

answered Jul 28, 2017 at 11:36

famargar

3,3587 gold badges29 silver badges46 bronze badges

2

Keep in mind that the resulting confusion matrix is not a proper metric for out of sample performance due to the fact that the threshold was chosen upon the test data which leads to data leakage. The proper way to do this is to split the data into train/validate/test. Train the classifier with the train data, choose the threshold with the validation data and evaluate the final model (threshold included) with the test set.
– Philipp
Oct 30, 2019 at 13:51
yes you are right, I oversimplified the answer
– famargar
Jan 6, 2022 at 11:02
I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. Could you elaborate a little bit more in this step? How do you know which probability value corresponds to the ROC point?
– stateMachine
May 21, 2023 at 3:40

Add a comment |

E.J. White · Accepted Answer · 2017-01-26 00:29:45Z

0

It's difficult to provide an exact answer without any specific code examples. If you're already doing cross validation, you might consider specifying the AUC as the parameter to optimize:

shuffle = cross_validation.KFold(len(X_train), n_folds=10, shuffle=True)
scores = cross_val_score(classifier, X_train, y_train, cv=shuffle, scoring='roc_auc')

answered Jan 26, 2017 at 0:29

E.J. White

1044 bronze badges

1

Hi White, thanks for your reply. I optimized it by choosing roc_auc and other metrics that were of my interest at the time (I also created a custom scorer to optimize LR+) . My main doubt is how to choose one of the threshold shown by a point on ROC curve as the threshold for when I call predict() ? My question is related to (<github.com/scikit-learn/scikit-learn/issues/4813>) . I’m not sure this would be available for trees, as they usually to not use probas. But how to set it for other methods, then?
– Colis
Jan 26, 2017 at 14:01

Add a comment |

Collectives™ on Stack Overflow

How to set a threshold for a sklearn classifier based on ROC results?

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
scikit-learn
classification
threshold
roc
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonscikit-learnclassificationthresholdroc or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
scikit-learn
classification
threshold
roc
or ask your own question.