P4 metric, a new way to evaluate binary classifiers
Introduction
Binary classifiers are accompanying us on a daily basis. Tests that detect disease, give us the answer: positive/negative, spam filters say spam/not spam, smartphones that authenticate us based on a face scan or fingerprint - make a known/unknown decision. The question: how to evaluate the efficiency of such a classifier does not seem extremely complicated. Just choose the one that will predict the most cases correctly. As many of us have already realized - the actual evaluation of a binary classifier requires somewhat more sophisticated means. But we'll talk about that in a moment.
A short story about choosing a classifier
The most straightforward approach is to calculate the ratio of correctly classified samples to all the considered samples. That is what we call accuracy: $$\mathrm{ACC} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}}$$ In the above equation, \(\mathrm{TP}\) means true positives, \(\mathrm{TN}\) - true negatives, \(\mathrm{FP}\) - false positives, \(\mathrm{FN}\) - false negatives.
Now suppose we have a population \(\mathrm{(I)}\) of \(10000\) elements in which only 0.05% of the elements are positive. We consider two classifiers \(\mathrm{A}\) and \(\mathrm{B}\). Classifier \(\mathrm{A}\), correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier - \(\mathrm{B}\) gives the answer "negative" every time - regardless of the classified element. Accuracy for classifier \(\mathrm{A}\) is \(0.9\), while for classifier \(\mathrm{B}\): \(0.9995\). So if we were to be guided by accuracy alone when choosing a classifier, we run the risk of choosing a classifier that common sense would rather dictate to reject. And this is where two quantities come to our rescue: precision and recall, defined as: $$\mathrm{PREC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$$ $$\mathrm{REC} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$$ When \(\mathrm{TP}\) goes to zero, they both are going to zero. When \(\mathrm{FP}\)/\(\mathrm{FN}\) is growing - \(\mathrm{PREC}\)/\(\mathrm{REC}\) is decreasing. Combining their values seems to be a great idea. And this is the idea behind the F1 metric (see: [1] The truth of the F-measure). It's defined as follows: $$\mathrm{F}_1 = \frac{2}{\frac{1}{\mathrm{PREC}} + \frac{1}{\mathrm{REC}}}$$ For the classifier \(\mathrm{A}\) it gives score \(0.0079\) and for the classifier \(\mathrm{B}\): \(0.0\). It looks like we control the situation now... Unfortunately, we are far from the truth.
Let us consider another example - created by changing the label names in the previous example - from negatives to positives and vice versa. So, now the population \(\mathrm(II)\) - consit of \(10000\) elements in which 99.95% of the elements are positive. Classifier \(\mathrm{A}'\), correctly recognizes 90% of positive elements, and 90% of negative elements. Classifier - \(\mathrm{B}'\) gives the answer "positive" every time - regardless of the classified element. And now... our "cheating" classifier \(\mathrm{B}'\) receives note \(\mathrm{F}_1 = 0.9997\), while the rather solid classifier \(A'\) only \(0.9473\). Our "golden" solution was suffering the asymmetry problem.
So? What is the solution? Some of the researchers are pointing the Youden Index ([2] Index for rating diagnostic tests). Some other prefer Markedness (Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation). None of them are free of problems - one can test them using the interactive confusion matrix in the further section.
The last metric we want to mention in this section is \(\mathrm{MCC}\) - Matthews correlation coefficient (see: [4] B.W. Matthews article ). It seems to be immune to the effects mentioned above. It's included in the interactive simulation in the further section.
P4 and four conditional probabilities
For each classifier applied to the dataset, we can define the four essential probabilities: $$\begin{eqnarray} P_A & = & P(+|C+) \\ P_B & = & P(C+|+) \\ P_C & = & P(C-|-) \\ P_D & = & P(-|C-) \\ \end{eqnarray} $$ where the meaning of them is as follows:- \(P(+|C+)\) is a conditional probability of sample being positive, provided it is classified as positive.
- \(P(C+|+)\) is a conditional probability of sample being classified as positive, provided is a positive sample
- \(P(C-|-)\) is a conditional probability of sample being classified as negative, provided is a negative sample
- \(P(-|C-)\) is a conditional probability of sample being negative, provided it is classified as negative
Based on these probabilities, P4 metric is defined as:
$$\mathrm{P}_4 = \frac{4}{\frac{1}{\mathrm{P_A}} + \frac{1}{\mathrm{P_B}} + \frac{1}{\mathrm{P_C}} + \frac{1}{\mathrm{P_D}}}$$ and that gives: $$ \mathrm{P}_4 = \frac{4\cdot\mathrm{TP}\cdot\mathrm{TN}}{4\cdot\mathrm{TP}\cdot\mathrm{TN} + (\mathrm{TP} + \mathrm{TN}) \cdot (\mathrm{FP} + \mathrm{FN})} $$The metric defined this way belongs to the range \([0,1]\) (as opposed to \(\mathrm{MCC}\), Youden index and markedness). It is defined in a similar manner to the \(\mathrm{F}_1\), however it covers all four probabilities instead of two of them like \(\mathrm{F}_1\). It also does not change its value when the labeling of the dataset is reversed.
The details about the ideas behind P4 metric itself, can be found in the article: [5] Extending F1, probabilistic approach