Computer Science > Machine Learning

P4 metric, a new way to evaluate binary classifiers

Introduction

Binary classifiers are accompanying us on a daily basis. Tests that detect disease, give us the answer: positive/negative, spam filters say spam/not spam, smartphones that authenticate us based on a face scan or fingerprint - make a known/unknown decision. The question: how to evaluate the efficiency of such a classifier does not seem extremely complicated. Just choose the one that will predict the most cases correctly. As many of us have already realized - the actual evaluation of a binary classifier requires somewhat more sophisticated means. But we'll talk about that in a moment.

A short story about choosing a classifier

The most straightforward approach is to calculate the ratio of correctly classified samples to all the considered samples. That is what we call accuracy: $$\mathrm{ACC} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}}$$ In the above equation, \(\mathrm{TP}\) means true positives, \(\mathrm{TN}\) - true negatives, \(\mathrm{FP}\) - false positives, \(\mathrm{FN}\) - false negatives.

Now suppose we have a population \(\mathrm{(I)}\) of \(10000\) elements in which only 0.05% of the elements are positive. We consider two classifiers \(\mathrm{A}\) and \(\mathrm{B}\). Classifier \(\mathrm{A}\), correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier - \(\mathrm{B}\) gives the answer "negative" every time - regardless of the classified element. Accuracy for classifier \(\mathrm{A}\) is \(0.9\), while for classifier \(\mathrm{B}\): \(0.9995\). So if we were to be guided by accuracy alone when choosing a classifier, we run the risk of choosing a classifier that common sense would rather dictate to reject. And this is where two quantities come to our rescue: precision and recall, defined as: $$\mathrm{PREC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$$ $$\mathrm{REC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$$ When \(\mathrm{TP}\) goes to zero, they both are going to zero. When \(\mathrm{FP}\)/\(\mathrm{FN}\) is growing - \(\mathrm{PREC}\)/\(\mathrm{REC}\) is decreasing. Combining their values seems to be a great idea. And this is the idea behind the F1 metric (see: [1] The truth of the F-measure). It's defined as follows: $$\mathrm{F}_1 = \frac{2}{\frac{1}{\mathrm{PREC}} + \frac{1}{\mathrm{REC}}}$$ For the classifier \(\mathrm{A}\) it gives score \(0.0079\) and for the classifier \(\mathrm{B}\): \(0.0\). It looks like we control the situation now... Unfortunately, we are far from the truth.

Let us consider another example - created by changing the label names in the previous example - from negatives to positives and vice versa. So, now the population \(\mathrm(II)\) - consit of \(10000\) elements in which 99.95% of the elements are positive. Classifier \(\mathrm{A}'\), correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier - \(\mathrm{B}'\) gives the answer "positive" every time - regardless of the classified element. And now... our "cheating" classifier \(\mathrm{B}'\) receives note \(\mathrm{F}_1 = 0.9997\), while the rather solid classifier \(A'\) only \(0.9473\). Our "golden" solution was suffering the asymmetry problem.

So? What is the solution? Some of the researchers are pointing the Youden Index ([2] Index for rating diagnostic tests). Some other prefer Markedness (Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation). None of them are free of problems - one can test them using the interactive confusion matrix in the further section.

The last metric we want to mention in this section is \(\mathrm{MCC}\) - Matthews correlation coefficient (see: [4] B.W. Matthews article ). It seems to be immune to the effects mentioned above. It's included in the interactive simulation in the further section.

P4 and four conditional probabilities

For each classifier applied to the dataset, we can define the four essential probabilities: $$\begin{eqnarray} P_A & = & P(+|C+) \\ P_B & = & P(C+|+) \\ P_C & = & P(C-|-) \\ P_D & = & P(-|C-) \\ \end{eqnarray} $$ where the meaning of them is as follows:

Based on these probabilities, P4 metric is defined as:

$$\mathrm{P}_4 = \frac{4}{\frac{1}{\mathrm{P_A}} + \frac{1}{\mathrm{P_B}} + \frac{1}{\mathrm{P_C}} + \frac{1}{\mathrm{P_D}}}$$ and that gives: $$ \mathrm{P}_4 = \frac{4\cdot\mathrm{TP}\cdot\mathrm{TN}}{4\cdot\mathrm{TP}\cdot\mathrm{TN} + (\mathrm{TP} + \mathrm{TN}) \cdot (\mathrm{FP} + \mathrm{FN})} $$

The metric defined this way belongs to the range \([0,1]\) (as opposed to \(\mathrm{MCC}\), Youden index and markedness). It is defined in a similar manner to the \(\mathrm{F}_1\), however it covers all four probabilities instead of two of them like \(\mathrm{F}_1\). It also does not change its value when the labeling of the dataset is reversed.

The details about the ideas behind P4 metric itself, can be found in the article: [5] Extending F1, probabilistic approach

Interactive simulation

The interactive matrix below, allows one to experiment with different scenarios and see how individual probabilities and metrics change with \(\mathrm{TP}/\mathrm{FP}/\mathrm{TN}/\mathrm{FN}\). The level of the probabilities and the metrics having values in the interval \([0,1]\) is marked with the blue color, while for those varying in \([-1, 1]\) - with yellow.
Population
Confusion Matrix
True Positives
False Positives
False Negatives
True Negatives
Probabilities
\(P(+|C+)\)
{{f_prec}}
\(P(C+|+)\)
{{f_rec}}
\(P(C-|-)\)
{{f_spec}}
\(P(-|C-)\)
{{f_npv}}
Metrics
P4
{{f_p4}}
F1
{{f_f1}}
MCC
{{f_mcc}}
Youden
{{f_youden}}
Markedness
{{f_mark}}
Accuracy
{{f_accuracy}}
URL:

Interesting cases

Below we gathered four interesting cases of the confusion matrix. In each of them, one conditional probability is close to 0, while the other ones are close to 1.

References

[1] Sasaki, Yutaka The truth of the F-measure,Teach Tutor Mater (2007)
[2] Youden, William J.Index for rating diagnostic tests, Wiley Online Library (1950)
[3] Powers, David M. W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, arxiv:2010.16061 (2010)
[4] B.W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (1975).
[5] Mikolaj Sitarz Extending F1 metric, probabilistic approach, arXiv:2210.11997 (2022)