How to explain the ROC AUC score and ROC curve?

The ROC AUC score is a popular metric lớn evaluate the performance of binary classifiers. To compute it, you must measure the area under the ROC curve, which shows the classifier's performance at varying decision thresholds. 

This chapter explains how lớn plot the ROC curve, compute the ROC AUC and interpret it. We will also showcase it using the open-source Evidently Python library.

Bạn đang xem: How to explain the ROC AUC score and ROC curve?

TL;DR

  • The ROC curve shows the performance of a binary classifier with different decision thresholds. It plots the True Positive rate (TPR) against the False Positive rate (FPR).
  • The ROC AUC score is the area under the ROC curve. It sums up how well a model can produce relative scores lớn discriminate between positive or negative instances across all classification thresholds. 
  • The ROC AUC score ranges from 0 lớn 1, where 0.5 indicates random guessing, and 1 indicates perfect performance.

What is a ROC curve?

The ROC curve stands for the Receiver Operating Characteristic curve. It is a graphical representation of the performance of a binary classifier at different classification thresholds. 

The curve plots the possible True Positive rates (TPR) against the False Positive rates (FPR).

Here is how the curve can look:

ROC curve chart

Each point on the curve represents a specific decision threshold with a corresponding True Positive rate and False Positive rate.

What is a ROC AUC score?

ROC AUC stands for Receiver Operating Characteristic Area Under the Curve. 

ROC AUC score is a single number that summarizes the classifier's performance across all possible classification thresholds. To get the score, you must measure the area under the ROC curve.

ROC AUC score

ROC AUC score shows how well the classifier distinguishes positive and negative classes. It can take values from 0 lớn 1.

A higher ROC AUC indicates better performance. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.

To understand the ROC AUC metric, it helps lớn understand the ROC curve first. 

How does the ROC curve work?

Let’s explain it step by step! We will cover:

  • What TPR and FPR are, and how lớn calculate them
  • What a classification threshold is
  • How lớn plot a ROC curve

True vs. False Positive rates

The ROC curve plots the True Positive rate (TPR) against the False Positive rate (FPR) at various classification thresholds. You can derive TPR and FPR from a confusion matrix. 

A confusion matrix summarizes all correct and false predictions generated for a specific dataset. Here is an example of a matrix generated for a spam prediction use case:

Confusion matrix example

Source: example matrix from Confusion Matrix chapter.

You can calculate the True Positive and False Positive rates directly from the matrix.

True positive rate and false positive rate

TPR (True Positive rate, also known as recall) shows the share of detected true positives. For example, the share of emails correctly labeled as spam out of all spam emails in the dataset.

To compute the TPR, you must divide the number of True Positives by the total number of objects of the target class – both identified (True Positives) and missed (False Negatives). 

Recall metric formula

In the example confusion matrix above, TPR = 600 / ( 600 + 300) = 0.67. The model successfully detected 67% of all spam emails.

FPR (False Positive rate) shows the share of objects falsely assigned a positive class out of all objects of the negative class. For example, the proportion of legitimate emails falsely labeled as spam.

You can calculate the FPR by dividing the number of False Positives by the total number of objects of the negative class in the dataset.

You can think of the FPR as a "false alarm rate."

False positive rate formula

In our example, FPR = 100 / (100 + 9000) = 0.01. The model falsely flagged 1% of legitimate emails as spam.

To create the ROC curve, you need lớn plot the FPR values against TPR values at different decision thresholds. 

Classification threshold 

You might ask, what vì thế "different" TPR and FPR values mean? Did we not just calculate them once and for all? 

In fact, we calculated the values for a given confusion matrix at a given decision threshold. But for a probabilistic classification model, these TPR and FPR values are not mix in stone. 

You can vary the decision threshold that defines how lớn convert the model predictions into labels. This, in turn, can change the number of errors the model makes. 

Classification decision threshold

A probabilistic classification model returns a number from 0 lớn 1 for each object. For example, for each gmail, it predicts how likely this gmail is spam. For a given gmail, it can be 0.1, 0.55, 0.99, or any other number. 

You then have lớn decide at which probability you convert this prediction lớn a label. For instance, you can label all emails with a predicted probability of over 0.5 as spam. Or, you can only apply this decision when the score is 0.8 or higher. 

This choice is what sets the classification threshold. 

To better understand the impact of the decision threshold, explore the Classification Threshold chapter in the guide.

As you change the threshold, you will usually get new combinations of errors of different types (and new confusion matrices)!

Confusion matrices with different classification thresholds

When you set the threshold higher, you make the model "more conservative." It assigns the True label when it is "more confident." But as a consequence, you typically lower recall: you detect fewer examples of the target class overall.

When you set the threshold lower, you make the model "less strict." It assigns the True label more often, even when "less confident." Consequently, you increase recall: you will detect more examples of the target class. However, this may also lead lớn lower precision, as the model may make more False Positive predictions. 

TPR and FPR change in the same direction. The higher the recall (TPR), the higher the rate of false positive errors (FPR). The lower the recall, the fewer false alarms the model gives.

In the example above, the recall (TPR) decreases as we mix the different decision higher:

- 0.5 threshold: 800/(800+100)=0.89
- 0.8 threshold: 600/(600+300)=0.67
- 0.95 threshold: 200/(200+700)=0.22

The FPR also goes down:

- 0.5 threshold: 500/(500+8600)=0.06
- 0.8 threshold: 100/(100+9000)=0.01
- 0.95 threshold: 10/(10+9090)=0.001

Confusion matrices with different classification thresholds

Plotting the ROC curve

Now, let’s get back lớn the curve!

The ROC curve illustrates this trade-off between the TPR and FPR we just explored. Unless your model is near-perfect, you have lớn balance the two. As you try lớn increase the TPR (i.e., correctly identify more positive cases), the FPR may also increase (i.e., you get more false alarms). 

For example, the more spam you want lớn detect, the more legitimate emails you falsely flag as suspicious. 

The ROC curve is a visual representation of this choice. Each point on the curve corresponds lớn a combination of TPR and FPR values at a specific decision threshold. 

To create the curve, you should plot the FPR values as the x-axis and the TPR values as the y-axis.

If we continue with the example above, here is how it can look.

Plotting the ROC curve

Since our imaginary model does fairly well, most values are "crowded" lớn the left. 

The left side of the curve corresponds lớn the more "confident" thresholds: a higher threshold leads lớn lower recall and fewer false positive errors. The extreme point is when both recall and FPR are 0. In this case, there are no correct detections but also no false ones. 

Xem thêm: Bảng xếp hạng dân số thế giới

The right side of the curve represents the "less strict" scenarios when the threshold is low. Both recall and False Positive rates are higher, ultimately reaching 100%. If you put the threshold at 0, the model will always predict a positive class: both recall, and the FPR will be 1.

When you increase the threshold, you move left on the curve. If you decrease the threshold, you move lớn the right.

ROC curve and decision threshold

Now, let’s take a look at the perfect scenario.

If our model is correct in all the predictions, all the time, it means that the TPR is always 1.0, and FPR is 0. It finds all the cases and never gives false alarms. 

Here is how the ROC curve would look.

Perfect ROC curve

Now, let’s look at the worst-case scenario. 

Let’s say our model is random. In other words, it cannot distinguish between the two classes, and its predictions are no better kêu ca chance.

A genuinely random model will predict the positive and negative classes with equal probability. 

ROC curve for a random model

The ROC curve, in this case, will look lượt thích a diagonal line connecting points (0,0) and (1,1). For a random classifier, the TPR is equal lớn the FPR because it makes the same number of true and false positive predictions for any threshold value. As the classification threshold changes, the TPR goes up or down in the same proportion as the FPR.

Most real-world models will fall somewhere between the two extremes. The better the model can distinguish between positive and negative classes, the closer the curve is lớn the top left corner of the graph.

ROC curve for a real-world model

A ROC curve is a two-dimensional reflection of classifier performance across different thresholds. It is convenient lớn get a single metric lớn summarize it. 

This is what the ROC AUC score does.

A ROC AUC score is a single metric lớn summarize the performance of a classifier across different thresholds. To compute the score, you must measure the area under the ROC curve.

ROC AUC score

There are different methods lớn calculate the ROC AUC score, but a common one is a trapezoidal rule. This involves approximating the area under the ROC curve by dividing it into trapezoids with vertical lines at the FPR values and horizontal lines at the TPR values. Then, you compute the area by summing the areas of the trapezoids.

You can compute ROC AUC in Python using sklearn.

If we return lớn our extreme "perfect" and "random" example, computing the ROC AUC score is easy. In the perfect scenario, we measure the square area: ROC AUC is 1. In the random scenario, it is precisely half: ROC AUC is 0.5.

ROC AUC for perfect and random models

What is a good ROC AUC?

The ROC AUC score can range from 0 lớn 1. A score of 0.5 indicates random guessing, and a score of 1 indicates perfect performance.

A score slightly above 0.5 shows that a model has at least "some" (albeit small) predictive power. This is generally inadequate for any real applications.

As a rule of thumb, a ROC AUC score above 0.8 is considered good, while a score above 0.9 is considered great. 

However, the usefulness of the model depends on the specific problem and use case. There is no standard. You should interpret the ROC AUC score in context, together with other classification quality metrics, such as accuracy, precision, or recall.

How lớn explain ROC AUC?

The intuition behind ROC AUC is that it measures how well a binary classifier can distinguish or separate between the positive and negative classes.

It reflects the probability that the model will correctly rank a randomly chosen positive instance higher kêu ca a random negative one.

For example, this is how the model predictions might look, arranged by the predicted output scores.

Model output score

ROC AUC reflects the likelihood that a random positive (red) instance will be located lớn the right of a random negative (gray) instance. 

It shows how well a model can produce good relative scores and generally assign higher probabilities lớn positive instances over negative ones. 

In the above picture, the classifier is not perfect but "directionally correct." It ranks most negative instances lower kêu ca positive ones.  

The ideal situation is lớn have all positive instances ranked higher kêu ca all negative instances, resulting in an AUC of 1.0.

Model output score

It’s worth noting that even a perfect ROC AUC does not mean the predictions are well-calibrated. A well-calibrated classifier produces predicted probabilities that reflect the actual probabilities of the events. Say, if it predicts that an sự kiện has a 70% chance of occurring, it should be correct about 70% of the time. ROC AUC is not a calibration measure. 

ROC AUC score, instead, shows how well a model can produce relative scores that help discriminate between positive or negative instances.

ROC AUC pros and cons 

Let’s sum up the important properties of the metric.

Here are some advantages of the ROC AUC score. 

  • A single number. ROC AUC reflects the model quality in one number. It is convenient lớn use a single metric, especially when comparing multiple models. 
  • Does not change with the classification threshold. Unlike precision and recall, ROC AUC stays the same. In fact, it sums up the performance across the different classification thresholds. It is a valuable "overall" quality measure, whereas precision and recall provide a quality "snapshot" at a given decision threshold.
  • It is a suitable evaluation metric for imbalanced data. ROC AUC measures the model's ability lớn discriminate between the positive and negative classes, regardless of their relative frequencies in the dataset.
  • More tolerant lớn the drift in class balance. The ROC AUC generally remains more stable if the distribution of classes changes. This often happens in production use, for example, when fraud rates vary month-by-month. If they change significantly, the earlier chosen decision threshold might become inadequate. For example, if fraud becomes more prevalent, the recall of the fraud detection model might drop, as this metric uses the absolute number of actual fraud cases in the denominator. However, ROC AUC might remain stable, indicating that the model can still differentiate between the two classes despite the changes in their relative frequencies.
  • Scale-invariant. ROC AUC measures how well predictions are ranked rather kêu ca their absolute values. This way, it helps compare the quality of models that might output "different ranges" of predicted probabilities. It is typically relevant when you experiment with different models during the model training stage.

The metric also has a few downsides. As usual, a lot depends on the context!

  • ROC AUC is not intuitive. This metric can be hard lớn explain lớn business stakeholders and does not have an immediately interpretable meaning.
  • It does not consider the cost of errors. ROC AUC does not trương mục for different types of errors and their consequences. In many scenarios, false negatives can be more costly kêu ca false positives, or vice versa. In this case, working lớn balance precision and recall and setting the appropriate classification threshold lớn minimize a certain type of error is often a more suitable approach. ROC AUC is not useful for this type of optimization.
  • It can be misleading if the class imbalance is severe. When the positive class is very small, ROC AUC can give a false impression of high quality. Imagine that a classifier predicts almost all instances as negative. TPR and FPR will be close lớn 0 because there are few positive predictions. As a result, the ROC curve will appear lớn "hug" the top left corner of the plot, giving the impression that the classifier is performing well and definitely better kêu ca random. However, though it correctly classifies most of the negative instances, it may miss most of the positives, which is likely more important for the model performance. In this case, it may be more appropriate lớn look at the precision-recall curve and rely on metrics lượt thích precision, recall, or F1-score lớn evaluate ML model quality.
Want lớn see an example of using ROC AUC? We prepared a tutorial on the employee churn prediction problem "What is your model hiding?". You will train two classification models with similar ROC AUC and explore how lớn compare them. 

When lớn use ROC AUC 

Considering all the above, ROC AUC is useful, but as usual, not a perfect metric. 

  • During model training, it helps compare multiple ML models against each other.
  • ROC AUC is particularly useful when the goal is lớn rank predictions in order of their confidence level rather kêu ca produce well-calibrated probability estimates.
  • Both in training and production evaluation, ROC AUC helps provide a more complete picture of the model performance, giving a single metric that sums up the quality across different thresholds. 

However, there are limitations:

  • ROC AUC is less useful when you care about different costs of error and want lớn find the optimal threshold lớn optimize for the cost of a specific error.  
  • It can be misleading when the data is heavily imbalanced (which is coincidentally often the cases where you ultimately care about different costs of errors).

ROC AUC in ML monitoring 

You can use ROC AUC during production model monitoring as long as you have the true labels lớn compute it. 

However, a high ROC AUC score does not communicate all relevant aspects of the model quality. The score evaluates the degree of separability and does not consider the asymmetric costs of false positives and negatives. It captures, in one number, the quality of the model across all possible thresholds.

In many real-world scenarios, this overall performance is not relevant: you need lớn consider the costs of error and define a specific threshold lớn make automated decisions. Therefore, the ROC AUC score should be used with other metrics, such as precision and recall. You might also want lớn monitor precision and recall for specific important segments in your data (such as users in specific locations, premium users, etc.) lớn capture differences in performance.

However, having ROC AUC as an additional metric might still be informative. For example, in cases where the shifting balance of classes might negatively impact recall, tracking ROC AUC might communicate whether the model itself remains reasonable.

Xem thêm: Màu xanh dương là gì? Ý nghĩa màu xanh dương trong cuộc sống

ROC curve in Python

ROC curve in Evidently

To quickly calculate and visualize the ROC curve and ROC AUC score, as well as other metrics and plots lớn evaluate the quality of a classification model, you can use Evidently, an open-source Python library lớn evaluate, test and monitor ML models in production. 

You will need lớn prepare your dataset that includes predicted values for each class and true labels and pass it lớn the tool. You will instantly get an interactive report that includes ROC AUC, accuracy, precision, recall, F1-score metrics as well as other visualizations. You can also integrate these model quality checks into your production pipelines. 

BÀI VIẾT NỔI BẬT