P4 metric, a new way to evaluate binary classifiers

Binary classifiers are accompanying us on a daily basis. Tests that detect disease, give us the answer: positive/negative, spam filters say spam/not spam, smartphones that authenticate us based on a face scan or fingerprint - make a known/unknown decision. The question: how to evaluate the efficiency of such a classifier does not seem extremely complicated. Just choose the one that will predict the most cases correctly. As many of us have already realized - the actual evaluation of a binary classifier requires somewhat more sophisticated means. But we'll talk about that in a moment.

The most straightforward approach is to calculate the ratio of correctly classified samples to all the considered samples. That is what we call accuracy: $$\mathrm{ACC} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}}$$ In the above equation, $\mathrm{TP}$ means true positives, $\mathrm{TN}$ - true negatives, $\mathrm{FP}$ - false positives, $\mathrm{FN}$ - false negatives.

Now suppose we have a population $\mathrm{(I)}$ of $10000$ elements in which only 0.05% of the elements are positive. We consider two classifiers $\mathrm{A}$ and $\mathrm{B}$. Classifier $\mathrm{A}$, correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier - $\mathrm{B}$ gives the answer "negative" every time - regardless of the classified element. Accuracy for classifier $\mathrm{A}$ is $0.9$, while for classifier $\mathrm{B}$: $0.9995$. So if we were to be guided by accuracy alone when choosing a classifier, we run the risk of choosing a classifier that common sense would rather dictate to reject. And this is where two quantities come to our rescue: precision and recall, defined as: $$\mathrm{PREC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$$ $$\mathrm{REC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$$ When $\mathrm{TP}$ goes to zero, they both are going to zero. When $\mathrm{FP}$/$\mathrm{FN}$ is growing - $\mathrm{PREC}$/$\mathrm{REC}$ is decreasing. Combining their values seems to be a great idea. And this is the idea behind the F₁ metric (see: [1] The truth of the F-measure). It's defined as follows: $$\mathrm{F}_1 = \frac{2}{\frac{1}{\mathrm{PREC}} + \frac{1}{\mathrm{REC}}}$$ For the classifier $\mathrm{A}$ it gives score $0.0079$ and for the classifier $\mathrm{B}$: $0.0$. It looks like we control the situation now... Unfortunately, we are far from the truth.

Let us consider another example - created by changing the label names in the previous example - from negatives to positives and vice versa. So, now the population $\mathrm(II)$ - consit of $10000$ elements in which 99.95% of the elements are positive. Classifier $\mathrm{A}'$, correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier - $\mathrm{B}'$ gives the answer "positive" every time - regardless of the classified element. And now... our "cheating" classifier $\mathrm{B}'$ receives note $\mathrm{F}_1 = 0.9997$, while the rather solid classifier $A'$ only $0.9473$. Our "golden" solution was suffering the asymmetry problem.

So? What is the solution? Some of the researchers are pointing the Youden Index ([2] Index for rating diagnostic tests). Some other prefer Markedness (Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation). None of them are free of problems - one can test them using the interactive confusion matrix in the further section.

The last metric we want to mention in this section is $\mathrm{MCC}$ - Matthews correlation coefficient (see: [4] B.W. Matthews article ). It seems to be immune to the effects mentioned above. It's included in the interactive simulation in the further section.

P₄ and four conditional probabilities

For each classifier applied to the dataset, we can define the four essential probabilities: $$\begin{eqnarray} P_A & = & P(+|C+) \\ P_B & = & P(C+|+) \\ P_C & = & P(C-|-) \\ P_D & = & P(-|C-) \\ \end{eqnarray} $$ where the meaning of them is as follows:

$P(+|C+)$ is a conditional probability of sample being positive, provided it is classified as positive.
$P(C+|+)$ is a conditional probability of sample being classified as positive, provided is a positive sample
$P(C-|-)$ is a conditional probability of sample being classified as negative, provided is a negative sample
$P(-|C-)$ is a conditional probability of sample being negative, provided it is classified as negative

Based on these probabilities, P₄ metric is defined as:

$$\mathrm{P}_4 = \frac{4}{\frac{1}{\mathrm{P_A}} + \frac{1}{\mathrm{P_B}} + \frac{1}{\mathrm{P_C}} + \frac{1}{\mathrm{P_D}}}$$ and that gives: $$ \mathrm{P}_4 = \frac{4\cdot\mathrm{TP}\cdot\mathrm{TN}}{4\cdot\mathrm{TP}\cdot\mathrm{TN} + (\mathrm{TP} + \mathrm{TN}) \cdot (\mathrm{FP} + \mathrm{FN})} $$

The metric defined this way belongs to the range $[0,1]$ (as opposed to $\mathrm{MCC}$, Youden index and markedness). It is defined in a similar manner to the $\mathrm{F}_1$, however it covers all four probabilities instead of two of them like $\mathrm{F}_1$. It also does not change its value when the labeling of the dataset is reversed.

The details about the ideas behind P₄ metric itself, can be found in the article: [5] Extending F1, probabilistic approach

Population

Confusion Matrix

True Positives

False Positives

False Negatives

True Negatives

Probabilities

$P(+|C+)$

$P(C+|+)$

$P(C-|-)$

$P(-|C-)$

Metrics

URL:

{{experiment_url}}

Below we gathered four interesting cases of the confusion matrix. In each of them, one conditional probability is close to 0, while the other ones are close to 1.

[1] Sasaki, Yutaka The truth of the F-measure,Teach Tutor Mater (2007)

[2] Youden, William J.Index for rating diagnostic tests, Wiley Online Library (1950)

[3] Powers, David M. W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, arxiv:2010.16061 (2010)

[4] B.W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme, Biochimica et Biophysica Acta (1975).

[5] Mikolaj Sitarz Extending F1 metric, probabilistic approach, arXiv:2210.11997 (2022)

P₄ metric, a new way to evaluate binary classifiers

Introduction

A short story about choosing a classifier

P₄ and four conditional probabilities

Interactive simulation

Interesting cases

References

P4 metric, a new way to evaluate binary classifiers

Introduction

A short story about choosing a classifier

P4 and four conditional probabilities

Interactive simulation

Interesting cases

References

P₄ metric, a new way to evaluate binary classifiers

P₄ and four conditional probabilities