# P_{4} metric, a new way to evaluate binary classifiers

## Introduction

Binary classifiers are accompanying us on a daily basis. Tests that detect disease, give us the answer: positive/negative, spam filters say spam/not spam, smartphones that authenticate us based on a face scan or fingerprint - make a known/unknown decision. The question: how to evaluate the efficiency of such a classifier does not seem extremely complicated. Just choose the one that will predict the most cases correctly. As many of us have already realized - the actual evaluation of a binary classifier requires somewhat more sophisticated means. But we'll talk about that in a moment.

## A short story about choosing a classifier

The most straightforward approach is to calculate the ratio of correctly classified samples to all the considered samples. That is what we call *accuracy*:
$$\mathrm{ACC} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}}$$
In the above equation, \(\mathrm{TP}\) means *true positives*, \(\mathrm{TN}\) - *true negatives*, \(\mathrm{FP}\) - *false positives*, \(\mathrm{FN}\) - *false negatives*.

Now suppose we have a population \(\mathrm{(I)}\) of \(10000\) elements in which only 0.05% of the elements are positive. We consider two classifiers
\(\mathrm{A}\) and \(\mathrm{B}\). Classifier \(\mathrm{A}\), correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier -
\(\mathrm{B}\) gives the answer "negative" every time - regardless of the classified element. *Accuracy* for
classifier \(\mathrm{A}\) is \(0.9\), while for classifier \(\mathrm{B}\): \(0.9995\). So if we were to be guided by *accuracy* alone when choosing a classifier, we run the risk of choosing a
classifier that common sense would rather dictate to reject. And this is where two quantities come to our rescue: *precision* and *recall*, defined as:
$$\mathrm{PREC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$$
$$\mathrm{REC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$$
When \(\mathrm{TP}\) goes to zero, they both are going to zero.
When \(\mathrm{FP}\)/\(\mathrm{FN}\) is growing - \(\mathrm{PREC}\)/\(\mathrm{REC}\) is decreasing. Combining their values seems to be a great idea.
And this is the idea behind the F_{1} metric (see: [1] *The truth of the F-measure*). It's defined as follows:
$$\mathrm{F}_1 = \frac{2}{\frac{1}{\mathrm{PREC}} + \frac{1}{\mathrm{REC}}}$$
For the classifier \(\mathrm{A}\) it gives score \(0.0079\) and for the classifier \(\mathrm{B}\): \(0.0\). It looks like we control the situation now... Unfortunately, we are far from the truth.

Let us consider another example - created by changing the label names in the previous example - from negatives to positives and vice versa. So, now the population \(\mathrm(II)\) - consit of \(10000\) elements in which 99.95% of the elements are positive. Classifier \(\mathrm{A}'\), correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier - \(\mathrm{B}'\) gives the answer "positive" every time - regardless of the classified element. And now... our "cheating" classifier \(\mathrm{B}'\) receives note \(\mathrm{F}_1 = 0.9997\), while the rather solid classifier \(A'\) only \(0.9473\). Our "golden" solution was suffering the asymmetry problem.

So? What is the solution? Some of the researchers are pointing the *Youden Index* ([2] Index for rating diagnostic tests).
Some other prefer *Markedness* (Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation).
None of them are free of problems - one can test them using the interactive confusion matrix in the further section.

The last metric we want to mention in this section is \(\mathrm{MCC}\) - *Matthews correlation coefficient * (see: [4] B.W. Matthews article ).
It seems to be immune to the effects mentioned above. It's included in the interactive simulation in the further section.

## P_{4} and four conditional probabilities

For each classifier applied to the dataset, we can define the four essential probabilities:
$$\begin{eqnarray}
P_A & = & P(+|C+) \\
P_B & = & P(C+|+) \\
P_C & = & P(C-|-) \\
P_D & = & P(-|C-) \\
\end{eqnarray}
$$
where the meaning of them is as follows:
- \(P(+|C+)\) is a conditional probability of sample being positive, provided it is classified as positive.
- \(P(C+|+)\) is a conditional probability of sample being classified as positive, provided is a positive sample
- \(P(C-|-)\) is a conditional probability of sample being classified as negative, provided is a negative sample
- \(P(-|C-)\) is a conditional probability of sample being negative, provided it is classified as negative

Based on these probabilities, P_{4} metric is defined as:

The metric defined this way belongs to the range \([0,1]\) (as opposed to \(\mathrm{MCC}\), *Youden index* and *markedness*). It is defined in a
similar manner to the \(\mathrm{F}_1\), however it covers all four probabilities instead of two of them like \(\mathrm{F}_1\). It also does not change its
value when the labeling of the dataset is reversed.

The details about the ideas behind P_{4} metric itself, can be found in the article:
[5] *Extending F1, probabilistic approach*

## Interactive simulation

The interactive matrix below, allows one to experiment with different scenarios and see how individual probabilities and metrics change with \(\mathrm{TP}/\mathrm{FP}/\mathrm{TN}/\mathrm{FN}\). The level of the probabilities and the metrics having values in the interval \([0,1]\) is marked with the blue color, while for those varying in \([-1, 1]\) - with yellow.## Interesting cases

## References

*The truth of the F-measure*,Teach Tutor Mater (2007)

*Index for rating diagnostic tests*, Wiley Online Library (1950)

*Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation*, arxiv:2010.16061 (2010)

*Comparison of the predicted and observed secondary structure of t4 phage lysozyme*, Biochimica et Biophysica Acta (1975).

*Extending F1 metric, probabilistic approach*, arXiv:2210.11997 (2022)