Binary classifiers are accompanying us on a daily basis. Tests that detect disease, give us the answer: positive/negative, spam filters say spam/not spam, smartphones that authenticate us based on a face scan or fingerprint – make a known/unknown decision. The question: how to evaluate the efficiency of such a classifier does not seem extremely complicated. Just choose the one that will predict the most cases correctly. As many of us have already realized – the actual evaluation of a binary classifier requires somewhat more sophisticated means. But we’ll talk about that in a moment.

This article explores the extension of well-known F_{1} score used for assessing the performance of binary classifiers. We propose the new metric using probabilistic interpretation of precision, recall, specifcity, and negative predictive value. We describe its properties and compare it to common metrics. Then we demonstrate its behavior in edge cases of the confusion matrix. Finally, the properties of the metric are tested on binary classifier trained on the real dataset.

**Keywords**: machine learning, binary classifier, F_{1} , MCC, precision, recall

Article can be downloaded at: https://arxiv.org/abs/2210.11997

As you move the slider, you can watch live as the performance metrics and confusion matrix change:

Feel free to send your feedback and comments.

]]>

At the very beginning, we would like to reassure our readers – we will not convince anyone to abandon convolutional neural networks in favor of linear regression. The presented method will only allow us to gain better insight into the simple image recognition process itself. Don’t run away for a moment yet.

For our purposes, we will use the Fashion MNIST collection, published by Zalando (available on github). This is a collection of 70,000 images containing different pieces of clothing such as T-shirt, pants, shoes, tracksuits, etc. (each category has been assigned a numerical value: 0 – T-shirt, 1 – pants, 2 – pullover, …). Each is a grayscale image with a size of 28×28 pixels.

Before we move on to the actual image recognition algorithm, let’s consider a trivial example – a picture containing one grayscale pixel, with values from 0 to 255. For each “picture” in such a set, we assign a value of -1 or 1 (e.g.: -1 is “night” and 1 is “day”). The hypothetical set contains the classified “pictures” with -1/1 values assigned to them. This set will be used first, in order to present the classification method itself. In this case, we will apply the classical linear least squares algorithm and fit a straight line to the given set, as presented in the illustrative figure below:

For more convenient calculations, the values on the x-axis from 0 to 255 have been scaled to the interval (-1, 1). The individual points in the graph represent individual single-pixel “pictures”. Formally, here we are looking for the parameters \(a\) and \(b\) for a straight line with equation:

$$ y = a x + b $$

where \(x, y \in \mathbb{R}\), by minimizing the sum of squares of the distances of all points from the line. For a straight line fitted in this way, we consider the test sample to have a value of “1” when the calculated \(y > 0\), otherwise “-1”. That gives us the decision boundary as a vertical straight line crossing the point where the fitted line intersects the \(x\) axis.

Thus, the two classes of “pictures” are separated simply by a straight line, capturing the decision boundary for the set so defined quite well.

So far, the method is trivial. We have to admit however, that such an application of linear regression is not very practical, after all, we are unlikely to use single-pixel images on a daily basis. So, we will extend the method to work for real images. Let’s return to the actual task and the MNIST fashion set. For each of the images, we will retrieve the numerical values assigned to each pixel:

From the matrix thus formed, we will form a vector by concatenating successive rows of it. So, each 28×28 pixel image, will be represented by a 784-dimensional vector of real numbers. As in the previous case, we want to create a binary classifier – that is, one that gives us a “yes” or “no” answer (1, -1). This time, we want to recognize whether there is a t-shirt (1) or another piece of clothing (-1) in the given picture. So, we are now looking for 785 \(\theta\) parameters describing a 784-dimensional hyperplane:

$$y = \theta_0 + \sum_{i=1}^{784} \theta_i x_i$$

We will not publish a graphic representation of the 784-dimensional hyperplane, please forgive us .

Lets denote now the value of \(y\) calculated for a single sample \(i\) according to the above formula, as \(\hat{y}^{(i)}\) and the actual value as \(y^{(i)}\). The optimization objective will be the loss function, defined as:

$$\mathrm{L}(\theta) = \sum_{i=1}^{m} \left(y^{(i)} – \hat{y}^{(i)} \right)^2$$

where \(m\) is the number of images in the set under study. The goal is to find \(\theta\) parameters for which \(\mathrm{L}\) reaches minimum – classic least squares method, nothing more.

The authors of the Fashion MNIST collection recommend that 60000 images be treated as a training set and 10000 as a test set. In order to simplify the considerations, we will create new sets that consist of 50% images showing “t-shirt” and 50% images with other clothes. After this procedure, our training set contains 12000 images and the test set contains 2000 of them.

Pixel values between 0 and 255 will be normalized using convenient scikit-learn normalizer. We will classify the result as positive “1” (t-shirt) if the calculated \(y\) value is greater than zero. Otherwise, we will consider the result as a negative “-1” (any other garment).

The quality of the classifier, we will measure with the commonly used parameter – accuracy, defined as:

$$A = \frac{\mathrm{TP} + \mathrm {TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}}$$

Where subsequent quantities are represented as: \(\mathrm{TP}\) – true positives, \(\mathrm{TN}\) – true negatives, \(\mathrm{FP}\) – false positives, \(\mathrm{FN}\) – false negatives.

Following the commonly accepted methodology, the hyperplane parameters were fitted using the *training set*, while the classifier quality was evaluated on the *test set*. Since both sets contain half of the “positive” images (t-shirts) – the lower bound here will be a random classifier (randomizing the result with probability ½), which would give us an accuracy of 50%.

And here is the result obtained for the linear regression classifier – accuracy parameter for the test set:

$$A_{\%} = 92.5\%$$

When we consider that this machine learning method is one of the simplest, this is an unexpectedly satisfactory result (94.9% on the training set). The following table provides a detailed summary of the experiment:

Overpredicted samples | 89/2000 (4.45%) |

Underpredicted samples | 61/2000 (3.05%) |

Accuracy | 92.5% |

F1 | 92.6% |

The achieved result suggests that a significant part of the used set is linearly separable. Such circumstances as strictly narrowed set of images, their small size and appropriate choice of distribution of both sets also contributed to the final favorable result.

Apparently, as the number of dimensions increases, classical linear regression gains quite interesting possibilities and applications.

The source code used for the calculations: cv_linreg.py

]]>The Artinu application is a simple RPN calculator, written in JavaScript, available online.

Artinu was developed for educational purposes. RPN calculators available on the market, may overwhelm the novice user with an excess of buttons and available functions. Our calculator allows only basic actions to be performed, so that the user can easily grasp the idea of operations on the stack without having to worry about other issues. Detailed usage examples are contained in our documentation.

The use of RPN comes with great educational value – learning operations at the stake is a very good introduction to computer science and programming. An additional motivation for creating this application is the fact that currently in Poland, during the secondary school-leaving exam, young people can only use the simplest calculators. It seems that producing on its basis an inexpensive, physical device with similar properties for school purposes, should not pose any major difficulties.

**Links**:

When we want to know a standard deviation of a big population, we usually take a sample from the whole and than calculate estimator value. However it is not always clear which estimator should we use. Sometimes people argue whenever biased or unbiased standard deviation estimator is better. Below we explore this field and present the result of the numerical simulation.

While estimating expected value usually does not have any related controversy, for standard deviation we have two competing estimators: biased: \(\sigma_n\) and unbiased: \(\sigma_{n-1}\). They are defined as follows:

$$ \sigma_n = \sqrt{ \frac{1}{n} \sum_{i=1}^n (x_i – \bar{x})^2} $$

and:

$$ \sigma_{n-1} = \sqrt{ \frac{1}{n-1} \sum_{i=1}^n (x_i – \bar{x})^2} $$

Detailed description of estimator bias can be read in wikipedia article.

For our experiment we generated population of \(10^6\) numbers with normal distribution (\(\mu=0, \sigma=1\)). Than for the subsequent sample sizes starting from \(n=5\) we randomly choose \(n\) numbers from the population – twice (they are chosen using uniform distribution). The first set is used for calculating \(\sigma_n\), the second for \(\sigma_{n-1}\). For each sample size the procedure is repeated \(r=500\) times and the final result for given sample and given estimator is calculated as the average over all \(r\) achieved values.

The picture below, contains two charts presenting result of simulation.

On the upper chart, each point represents achieved average result for given estimator – biased marked as “+”, unbiased as “x”. Biased and unbiased estimators were calculated on different samples. The blue dashed line represents theoretical value for given distribution. The lower chart shows ratio between biased and unbiased value together with asymptotic theoretical value (dashed line). For all presented sample sizes in given range the unbiased estimator gives better estimate of real standard deviation value.

In contrast to the presented behavior, the result for the higher values of sample sizes appears to be different – see the charts attached below:

As we can observe on the charts for the higher values of sample sizes, the achieved value seems to be uncorrelated with the estimator type.

Although the results presented here were based on normal population distribution, we have also tested uniform and Poisson distribution. They give similar results.

For small sample sizes estimator \(\sigma_{n-1}\) gives definitely better results, closer to the population theoretical value. As the \(n\) grows, the crucial factors are becoming similar:

$$ \sqrt{\frac{1}{n}} \approx \sqrt{\frac{1}{n-1}} \quad (\text{for n being sufficiently big}),$$

thus the difference between estimators becomes irrelevant.

The presented results were computed using numpy. For pseudo-random number generation we used Mersenne-Twister algorithm included in the library. Details of the implementation can be found at stdev_simul.py

]]>Jakiś czas temu, podczas porządkowania szafy wpadły mi w ręce, moje stare szpargały. Notatki z wykładów z mechaniki kwantowej, które to notatki jako student w latach 90-tych skrzętnie prowadziłem. Gdy już się nacieszyłem wspomnieniami zacząłem się zastanawiać czy nie dałoby się nieco poprawić ich wyglądu, oczyścić ze zbędnych elementów. Na każdej stronie widnieje niebiesko-blada kratka, dodatkowo pojawiają się przebitki atramentu z drugiej strony kartki. Widoczne są również otwory na wpięcie do segregatora.

Ponieważ ręczne czyszczenie nie wchodziło w grę, wybór padł na algorytm K-Means – powszechnie stosowany w różnych obszarach *Machine Learning* / *Artificial Intelligence*. Pomyślałem sobie, że skoro radzi sobie dobrze jako pomoc w medycznym obrazowaniu to powinien poradzić sobie i z moimi notatkami.

K-Means bywa używany do automatycznej klasyfikacji zarówno punktów w przestrzeni rzeczywistej, jak i w przestrzeni kolorów RGB. W tym ostatnim przypadku, nierzadko jako algorytm redukujący liczbę kolorów. W naszym przypadku, zamiast redukcji kolorów, pozbędziemy się całkowicie klastrów z przestrzeni RGB do których należą punkty spoza samego tekstu – takich jak niebieska kratka, czy widoczne przebitki z drugiej strony. A ujmując rzecz ściślej – pozostawimy tylko jeden klaster – ten który najlepiej pasuje do obszarów obrazu zawierających pisany tekst.

Do eksperymentu, użyjemy implementacji K-Means z popularnego pakietu Scikit Learn. Parametrami naszej metody będzie liczba klastrów na które rozbijemy obraz, liczba losowych punktów z obrazu poddanych analizie oraz wartość startowa generatora liczb pseudolosowych. Pierwszym etapem będzie identyfikacja poszczególnych klastrów. W drugim etapie wydzielimy pojedynczy klaster i z niego skomponujemy nowy obraz.

Obrazy wyświetlone poniżej przedstawiają wynik rozbicia oryginalnego obrazu na 3 klastry. W celu wizualizacji punkty należące do danego klastra zostały zastąpione kolorem czerwonym. A zatem na pierwszym obrazie kolor czerwony odpowiada klastrowi nr 1, na drugim klastrowi nr 2, na trzecim – klastrowi nr 3.

Pierwszy klaster objął biały kolor tła oraz jasne elementy (napisy przebijające z drugiej strony). Klaster number dwa to różne odcienie niebieskiego – “złapał” część niebieskiej kratki oraz interesującego nas tekstu. Najbardziej obiecująco wygląda trzeci klaster, który poza samym tekstem praktycznie nie obejmuje żadnych innych elementów obrazu. Zatem ostatecznie możemy nasz obraz poddać obróbce – usuwając z niego punkty należące do klastrów o numerach 1 i 2. Poniżej prezentujemy wynik tej operacji:

Osiągnięty wynik można uznać za zadowalający, zwłaszcza jeśli weźmiemy pod uwagę prostotę zastosowanego podejścia. Notatki po obróbce są czytelne (nie licząc charakteru mojego pisma). Znakomita większość niepożądanych elementów została usunięta. Niewątpliwie można by nieco popracować nad zachowaniem oryginalnego koloru tekstu oraz jego wygładzeniem. Dla chętnych do dalszych eksperymentów załączam kod źródłowy użyty przeze mnie podczas obróbki: decompose_image.py.

]]>

Uniform distribution, gives a really nice impressions when looking through the chart showing standard deviation \(\sigma\) vs sample size \(n\):

It behaves as we expected – in accordance to the CLT – it varies like: \(\frac{1}{\sqrt{n}}\).

And now – stars of our show – Cauchy distribution:

…and Petersburg distribution:

In spite of trying many different pseudo-random generator seed values the result is the roughly the same – \(\sigma(n)\) does not seem to converge.

Conclusions:

Since distributions of sample means for the Cauchy and Petersburg distributions do not fulfill the assumptions of CLT – we didn’t expected the same result as in “properly behaving distributions” like uniform distribution. In spite we didn’t discover anything new with our “experimental statistics”, we he had a lot of fun and wanted to share it with you.

]]>

The CLT is formulated in a various ways. We will use the following simplified formulation:

For any distribution \(x\) having finite expected value \(\mu_x\) and variance \(\sigma_{x}^{2}\), the new variable made of:

$$y = \frac{1}{n} \sum_{i=1}^{n} x_i$$

tends to Gaussian distribution as n increases, and has the parameters:

$$\mu_y = \mu_x,$$

$$\sigma_y = \frac{\sigma_x}{\sqrt{n}}$$

What a delighting theorem! Whatever \(x\) distribution we have, the distribution of its averages tends to Gaussian. Works for any \(x\) distribution. Well… almost any distribution. The initial distribution should have finite expected value and variance. We usually leave this fact with the comment: “works for all properly behaving distributions”. Lets check what is the result for “improperly behaving” distributions.

The main actors in our show are as follows:

- Cauchy distribution – often called Breit-Wigner distribution in the world of physics. It has infinite variance and mean in spite of looking very innocent.
- “Petersburg distribution”, comes from so called St. Petersburg Paradox. Like in the previous case: \(\mu\) is infinite, as well as \(\sigma^2\). Its amusing that a rather innocent game (see Wikipedia’s article) leads to such distribution.
- Uniform distribution – nothing unusual, well defined \(\mu\) and \(\sigma^2\) used just for reference

The samples are generated using Mersenne Twister pseudo-random numbers generator. The “Petersburg” series are generated using the following code:

` ````
def single_petersburg_game():
sum_won = 0
n = 1
while True:
d = random.random()
if d > 0.5:
sum_won = n
n *= 2
else:
break
return sum_won
```

The numerical experiments were done by drawing \(10^9\) pseudo-random numbers having given distribution. Then for sampling distribution we used \(10^5\) samples, each of size \(100\).

Lets start with the most common case – uniform distribution. As we see on the figure below, nothing surprising is here:

The top chart shows the original distribution, the bottom one distribution of sample means. As we expected, we observe nice Gaussian here. As we see, CLT works great and the most interesting part is still ahead.

Now have a look at the analogous histograms produced for Cauchy distribution (please note the logarithmic scale on y axis):

The situation is significantly different here. The population size, sample size and the number of samples are the same as in previous example. The only difference is an input distribution.

I does not require any statistical tests to state that the distribution of sample means (SM) is far from being gaussian one. The maximum for SM distribution has the sample position as for

the original distribution. However the area beyond is definitely standing out (please note the range being much shorter on x axis than for population distribution, which is characteristic of SM distributions).

So – not Gauss and not longer Cauchy. Thus we observe the situation outside the scope of central limit theorem.

And now the most interesting part: “Petersburg” distribution – having infinite variance and expected value. Lets see how it looks like (double logarithmic scale):

We use the same population/sample sizes as before, the scale on y axis is logarithmic. We plot \(y=1/x^2\) together with distribution points in order to have an idea how the distribution roughly looks like).

One can also see x-log scale plot which gives us another look on the distribution of sample means. We can observe that values moved to the right from the reference \(1/x^2\) curve for distribution of sample means. As for now it looks that we can not expect observing any other tendency for this distribution.

At the end we would like to summarize this post with the following:

Do not trust the distributions with infinite variance and expected value

In the next part we will consider the behaviour of standard deviation for presented distributions.

]]>