A Modern Flatland: 2020

Dec 2, 2020

In the search of the “logistic regression”! (part II)

Unfortunately, if you ask a "data scientist" to explain the real machinery and rationale behind the logistic regression, his answer can hardly convince anyone, even if himself (which I doubt)! They repeat the same "ambiguous" explanations about logit function and logodds and how much it looks like a line! "Look! It is so simple: $\ln(\frac{p}{1-p})=\beta_0+\beta_1 x$". Things even get worse when they want to justify the strange form of the "cost function" (here for example). In part I, I already justified the cost function in a simple, but mathematically valid, language. In this post, I will explain it from another point of view: maximum likelihood estimation.

Do you remember the dean who was looking for a model to predict if an application gets accepted [here]? This time, after her earlier success, she wants to create a more complicated model! Multiple logistic regression
Assume that the dean adds another factor to her study, the language score TCF. She plots the GPA and TCF scores of each applicant in a scatter plot and to make things more clear, adds a colour to label the dots: green for acceptance, and red for rejection.

As trivial as it may seem, higher GPA and TCF seem to be related to the decision made on the application. The model to predict the likeliness of a data point to be red (0) or green (1) is similar to the one-dimensional example but in 3D, so it looks like an S-shaped surface:

Obtaining the best surface is similar as the 1D case in the first part: we define the new residuals $\epsilon_k=-\ln(1-e_k)$ with $e_k=|g(\eta_k)-y_k|$ for all $k=1,2,\cdots,m$. Then, we solve $\min \sum_{k=1}^m \epsilon_k$.

You may remember from our earlier discussion [here] that we should not solve $\min \|e\|$ because its corresponding cost function is non-convex. Illustrating this, consider the cost function for a very simple example: \[e_1=|1-g(\eta_1)|\qquad \eta_1=\beta_0-0.2 \beta_1 -0.3 \beta_2\\ e_2=|0-g(\eta_2)| \qquad \eta_2=\beta_0+0.1 \beta_1 +0.3 \beta_2\\ e_3=|1-g(\eta_3)| \qquad \eta_3=\beta_0+0.8 \beta_1+1.8 \beta_2\] with $\beta_0=1.5$. The cost function is clearly non-convex.

Decision boundaries
As we mentioned in the first part, the decision boundary consists of all points where $\eta=0$, that is: \[\beta_0+\beta_1x^{(1)}+\beta_2 x^{(2)}=0\] where $x^{(1)}$ is the first variable (Mention) and $x^{(2)}$ is the second variable (TCF), and the set $\beta_0,\beta_1,\beta_2$ is the best parameter set we obtained from the minimization problem. This gives the black line in the graph.

For this simple case, the decision boundary is a straight line. But if we need to describe more complicated cases, we can simply add higher-order terms like $(x^{(1)})^2$ or $x^{(1)}*x^{(2)}$, which leads to curved boundaries. For this example, if we add the squared of TCF number to our logistic model, we will obtain:

Who came up with such a strange name, logistic regression?

In 1798 Thomas Malthus published his book “An Essay on the Principle of Population”, warning of “future difficulties, on an interpretation of the population increasing in geometric progression (so as to double every 25 years) while food production increased in an arithmetic progression, which would leave a difference resulting in the want of food and famine, unless birth rates decreased”. It was then in 1838 that Pierre François Verhulst, a Belgian mathematician, modified Malthus’s model and obtained a more realistic function like $g(z)=\dfrac{1}{1+e^{-z}}$ predicting the population of countries quite well. He called it the logistic function but did not provide any justification for this name!

Maximum likelihood estimation: the hidden calculation behind the logistic regression
Maximum likelihood estimation or MLE, as we often call it, is a way to find the probability distribution function fitted to some data. Here is a good introduction:

For example, if we assume that the Gaussian normal distribution \[f_{normal}(x) = \dfrac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] is to describe the data’s distribution, we need to find the best two parameters, mean $\mu$ and the standard deviation $\sigma$, to describe the data. But what if the data (the event or output) is binomial (0 and 1, or rejection and acceptance)? We need the binomial distribution.

Consider

$m$ as the total number of observations (the total number of acceptances and rejections)
$p$ is the (average) acceptance chance of an application
$1-p$ is the (average) rejection chance of an application

The probability of having $x$ accepted applications is $p^x$ while $m-x$ rejections occur with the probability of $(1-p)^{m-x}$. So do we multiply these two probabilities to get the probability of getting $x$ acceptances and $m-x$ rejections? No! The $x$ acceptances and $m-x$ rejections can occur in different orders, that is, there are several different ways of distributing them. The binomial coefficient $C_{m,x}$ gives the number of possible combinations of having $x$ acceptances and $m-x$ rejections, so that the probability writes: \[\boxed{f_{binomial}(x)=C_{m,x}\,p^x\,(1-p)^{m-x}}\] where $C_{m,x}$ reads \[C_{m,x}=\dfrac{m!}{x!\,(m-x)!}\]

Example (binomial coefficient): If $m=4$ and $x=2$ (2 acceptances), then there are 6 different combinations in which these 2 acceptances occur (❌ is the rejection and ✔ is the acceptance):

✔✔❌❌

❌❌✔✔

✔❌❌✔

❌✔✔❌

❌✔❌✔

✔❌✔❌

The binomial coefficient will be then 6 as the formula suggests: $C_{4,2}=\dfrac{4!}{2!\, 2!}=6$.

Now that we reviewed the basics of the binomial distribution, we can see how the maximum likelihood estimation works for the binomial distribution: it finds the acceptance chance p which maximises the probability of having the observed number of acceptance and rejections. For example, assume that 7 out of 10 applications are rejected. So, the probability will be \[C_{10,3}\, p^3 \,(1-p)^7\]
MLE looks for the $p$ which makes this probability or likelihood of our result as large as possible. Simple calculations show that it is p=0.3which does this job! You can see that from the plot as well. Notice for example that if we considered $p=0.6$, which assumes 60% acceptance chance, it would be quite unlikely (less than 5%) that we got such a result (only 3 acceptances).

However, in our problem of predicting the decision based on GPA, things are not as easy as this because $p$ is clearly not the same for all applications; $p$ is not a constant anymore but a function of independent variables of the model (like GPA). Thus, instead of $p$, we have $p_k$ for each application $k=1,2,\cdots,m$. Ignoring the constant $C_{m,x}$, we will write the binomial distribution as \[L:=\sum_{k=1}^m p_k^{y_k}(1-p_k)^{(1-y_k)}\] where $y_k$ is the decision. And what is $p_k$? Similar to our discussion in the first part, we consider this function to be the logistic function $p_k=g(\eta_k)=\dfrac{1}{1+e^{-\eta_k}}$. So, the objective is to find the maximum of this likelihood function $L$.

As "logarithm of products is a summation of logarithms", people often work with log-likelihood $\ell$, which is the logarithms of the likelihood, that is $\ell=\ln(L)$ or \[\boxed{\ell :=\sum_{k=1}^m\Big( y_k \ln(p_k)+(1-y_k) \ln(1-p_k)\Big)}\] Don’t you think that this log-likelihood looks a bit familiar? This is exactly the cost function we wanted to minimize in the first part of this article, but with a negative sign! So, maximising the likelihood function is, in fact, minimizing the cost function! Now, we have a clear idea where that function originated from.

Oct 31, 2020

Discovering the lost "Bendixson's theorem"

Working on my PhD thesis, I had to investiage the "stability" of such an "evolutionary system" $U'(t)=(A+B)\,U$, where $U=(u_1,u_2,\cdots,u_n)$ is a vector and both $A$ and $B$ are square matrices of size $n$. Such a system describes the dynamics of the vector $U$ over the time. Here, stability means if the norm of the solution vector $U$ denoted by $\|U\|$ is bounded or if it goes to infinity $\displaystyle\lim_{t\to\infty}\|U\|$? This is a well-stablisehd topic in control thery and differential equations that the stability is linked to the "eigenvalues" of the matrix $A+B$ denoted by $(\lambda_k)_{j=1}^n$. What is important, in fact, is the sign of real parts of eigenvalues, that is:

if $\mathrm{Re}(\lambda_j)>0$ for any $j=1,2,\cdots,n$, the system is "unstable", norm of $U$ increases in time without any bound: $\displaystyle\lim_{t\to\infty}\|U(t)\|=\infty$
if $\mathrm{Re}(\lambda_j)<0$ for all $j=1,2,\cdots,n$, the system is "stable", norm of $U$ is bounded in time: $\displaystyle\lim_{t\to\infty}\|U(t)\|<\infty$

Of course, I knew this famous result. My problem, however, was that I could not really determine the sign of $\mathrm{Re}(\lambda_j)$ for my martix $A+B$ because they were parametric! What I had, in addition, was

Matrix $A$ is Hermitian (or symmetric) and negative-definite.
Matrix $B$ is skew-Hermitian (or skew-symmetric).

As matrix $A$ is Hermitian and negative-definite, we can conclude that the eigenvalues of matrix $A$ are in fact negative, so this part is "stable". So, the question of stability of $U'(t)=(A+B)\,U$ is boiled down to an algebraec question:

Can adding a skew-Hermitian matrix to a stable matrix make the sum $A+B$ unstable?

\[ \underbrace{A}_{\text{stable}} + \underbrace{B}_\text{skew-Hermitian} \Longrightarrow \text{stable?} \] I was digging the literature to answer this question. I found tons of inequalities (partially discussed in Trence Tao's blog) giving bounds on the eigenvalues of matrix summations, but nothing could help me!

Then, I came to the idea of generatring lots of random 2-by-2 matrices with the same properties as I want for $A$ and $B$ to see if $A+B$ is stable or no. And it was always stable! So, I got really suspicous and more motivated to continue my search. One day, google took me to Michele Benzi's website outlining the topic of the course "Matrix Analysis", and something caught my attention: Bendixson's theorem!

I was, of course, curious enough to google for this. I found another paper by Helmut Wielandt repharesing the original result of Bendixson:

And this is what I was looking for: it simply says that the real part of $\lambda_j$'s is bounded by the (real) eigenvalues of the Hermitian matrix ($A$) while the imaginary part is bounded by the (imaginary) eigenvalues of the skew-Hermitian matrix $B$.

This result is obtained by Bendixson in 1901 (for real matrices) and then got extended to complex matrices in 1902 by Hirsch:

My problem, fortunately, got solved by this intersting theorem and ended up as a journal paper. But I am still surprised how much mathematicians are not aware of Bendixson's theorem, even the great Terence Tao (!):

To be continued!

Oct 22, 2020

In the search of the “logistic regression”! (part I)

Unfortunately, if you ask a "data scientist" to explain the real machinery and rationale behind the logistic regression, his answer can hardly convince anyone, even if himself (which I doubt)! They repeat the same "ambiguous" explanations about logit function and logodds and how much it looks like a line! "Look! It is so simple: $\ln(\frac{p}{1-p})=\beta_0+\beta_1 x$". Things even get worse when they want to justify the strange form of the "cost function" (here for example). My first goal in this post is to justify the cost function in a simple, but mathematically valid, language. In the next post, I will explain it from another point of view: maximum likelihood estimation.

The dean of the Department of Economics and Business in a respectable university would like to predict whether an application submitted for their graduate program will be accepted or rejected, based on the applicant’s GPA. She reviews 20 applications from the last year, and she puts her findings in a scatter plot, where 0 and 1 correspond, respectively, to rejection or acceptance of an application. (In French Mention is grade!)

Even though the correlation between GPA and the application decisions is not that strict, she can still say that applicants with higher GPAs “seem more likely” to get accepted. But, rather than such rough judgments, the dean prefers to be more precise and get a model describing this “likeliness” or “probability”, a model which provides the acceptance probability given a particular GPA. What can she do?

Linear regression? Not a good idea!

One may rapidly suggest the linear regression (seriously?!). It looks a bit strange, however, let’s give it a chance. Linear regression will provide this yellow line which gives the line with smallest residual:

One may say that it is better than nothing, and I agree! But notice that the values the linear model predicts are not in the interval $[0,1]$, so one cannot assign them directly to a valid probability between $[0\%-100\%]$.

Transforming the linear regression? Not a good idea neither!

One may think of “transforming” the linear model to models living in $[0,1]$. A very common and well-known trick is to use the logistic function which has a "S"-shaped curve (sigmoid function) : \[y = g(z)=\dfrac{1}{1+e^{-z}}\]

In fact, the logistic function transforms the variable $z$ living in $[-\infty,+\infty]$ to $y$ which lives in $[0,1]$. This property is exactly what we were looking for!

So, it seems that the problem is solved and we can transform the output of our linear regression from $[-\infty,+\infty]$ to $[0,1]$. To do this, we still need the linear estimator $\eta=\beta_0+\beta_1 \,x$ but what we change is the estimated response $\widehat{y}=\eta$, which should be modified as $\widehat{y}=g(\eta)$.

Let’s look at what we get. The red line is the result of applying the logistic function on the yellow line. As we wished, the estimated values are in $[0,1]$ which is great. But does this model make any sense? The yellow line minimizes the residuals $\min\|\eta-y\|$ among all other linear models. But what can we say about the red line? Does it reach the minimum of $\min\|g(z)-y\|$? No! In fact, we cannot find much mathematical sense for this transformed model unless we make a small modification.

Least-square of residuals? We are quite there!
Our persistent dean does not lose her hope and, this time, tries finding the solution of $\min\|g(z)-y\|$. This is what she gets (the red line):

Now, the values are in $[0,1]$ and she is sure that this logistic model is the best model in the sense that it minimizes the norm of residuals. Seems much better, no? But she should not be so excited yet!

Do you remember that for finding the minimum of a function, we find the points where the derivative of the function is zero? In fact, this zero-derivative condition is satisfied not only for the global minimum of the function, which is our objective but also for all local minima and maxima. If the function is “convex”, one can be sure that there is only one minimum, but what if it is “non-convex” (as you see in the picture)?

When you ask your computer to solve the minimization problem, it searches for the points which satisfy the zero-derivative condition, starting from the initial guess you feed it with, and it gets back to you as soon as it finds such a point. But, can one guarantee that what is founded is the global minimum?

Unless the function is convex, this point can even be a global maximum! In fact, for non-convex problems, your computer may get trapped in another point other than the global minimum and consider it as the solution you asked for. This is the drawback of $\min\|g(z)-y\|$; it is non-convex! It would be insightful to check this graphically:

The absolute difference of the estimated response $\widehat{y}_k$ (for every data points $k$) and the true response $y_k$ is called the residual $e_k$. Now, consider three residuals, $e_1$ and $e_3$ when the response is 1 and $e_2$ when it is zero: \[e_1=|1-g(\eta_1)| \qquad \eta_1=\beta_0-0.3\,\beta_1\\ e_2=|0-g(\eta_2)| \qquad \eta_2=\beta_0+0.2\,\beta_1\\ e_3=|1-g(\eta_3)| \qquad \eta_3=\beta_0+0.8\,\beta_1\] For simplicity, we fix $\beta_0=1.5$. So, we plot each residual as a function of $\beta_1$: $e_1^2$ in (red), $e_2^2$ (in blue), and $e_3^2$ (green). The sum $\sum_{k=1}^3 e_k^2$ (cost function) is also plotted in orange, which is clearly non-convex!

Notice that the minimization problem here is much more difficult than the one in the linear regression. There, the regression was linear, so we could employ the linear algebra (QR decomposition) to simplify the problem. Here, the logistic function makes the problem non-linear. So, we do not have any other choice but to use the zero-derivative condition.

But this is not the whole story. Even if God solve this minimization problem for you (which he do not!) it is not plausible to define the residuals similar to linear regression. Here, there are only two possible outputs: 0 (rejection) or 1 (acceptance). If the response is 0 and the estimate is 1, the residual is 1, but this “1” is really huge, it is the largest error one could make, it is not “only 1”! So, it makes sense to define residual differently such that the cost function shows the “huge” difference between the two possible outputs.

Re-defining the residuals? Finally … the salvation!

Rather than the usual residual $e_k=|\widehat{y}_k-y_k|$, we want to define a new residual $\epsilon_k$ to be huge when $e_k=1$. What about this \[\boxed{\epsilon_k=-\ln(1-e_k)}\] As we can see in the graph

if $e=0$ (smallest difference), $\epsilon=0$
If $e=1$ (largest difference), $\epsilon=+\infty$

So, for this new residual, having a difference of 1 is so huge, which is exactly what we were looking for. Then, instead of minimizing $\|e\|$, we consider another minimization problem $\min \sum_k\epsilon_k$. One can show that the function $\min \sum_k\epsilon_k$ convex, so has only one local (and global) minimum. The more common form of writing such a cost function is as follows: \[\boxed{\ell =\sum_{k=1}^m\Big[-y_k\,\ln(\widehat{y}_k)-(1-y_k)\,\ln(1-\widehat{y}_k)\Big]}\]

When $y_k=0$, the residual will be $-\ln(1-\widehat{y}_k)$
When $y_k=1$, the residual will be $-\ln(\widehat{y})k)$

which gives the same things as the other formula $\min \sum_k\epsilon_k$.

Considering the same example as before, we compute the new residuals: $\epsilon_1$ and $\epsilon_3$ when the response is 1 and $\epsilon_2$ when it is zero: \[\epsilon_1=-\ln(1-e_1)\\ \epsilon_2=-\ln(1-e_2)\\ \epsilon_3=-ln(1-e_3)\] So, we plot each residual as a function of 1: $\epsilon_1^2$ (in red), $\epsilon_2$ (in blue), and $\epsilon_3$ (in green). The sum k=1mk (cost function) is also plotted in orange, which is convex!

Let us solve $\min \sum_{k=1}^m \epsilon_k$ and obtain the logistic model as this new red line, which has values in $[0,1]$, whose residuals are based on a correct intuition, and is much easier to be computed due to convexity.

Having probabilities is nice but sometimes we want to be more strict, we want to decide: 0 or 1! In such cases, we often consider the threshold probability of 50% based on the logistic model: if the probability is more than 50% the application will be accepted and rejected otherwise.

Mathematically speaking, this is related to the sign of our linear estimator $\eta=\beta_0+\beta_1\, x$

If $\eta<0$, the probability will be more than 50%
If $\eta>0$, the probability will be more than 50%
If $\eta=0$, the probability is 50%. This gives us the “decision boundary” which is coloured black in the graph for Mention = 17.29983.

I should stop here as this post took too long. But we are not yet finished! In the next post, the dean plans to add another variable to her study, so we will face with the multiple logistic regression. It would be so much fun! We will also travel to 19th century to discover the origins of the logistic regression!

Aug 16, 2020

Student’s t-test: judge the world with a small sample!

Assume that, for some reasons, you weight a can of tuna by kitchen scale and notice that the weight is quite smaller than it should be. As you are a persistent person you keep weighting 6 more cans and the average weight is still far from what is mentioned on the label. So You probably wonder if the company is cheating and if you have found trustable evidence, no?

Notice that your evidence, the can’s average weight, is based on the sample of only 7 cans you could get your hands on. However, what you want to claim is about the average weight of the whole population of tuna cans that the company produces. Are we able to prove this fraud with such a small sample?

This problem is very common in statistics when you need to make a judgement on the population mean based on the sample mean. Such a judgement is ultimately important in practice, as the population is often not accessible or it is quite costly and difficult to do the investigation on. This is the main motivation for the t-test!

Let us first introduce some simple notations:

n: number of samples: X1, X2, ...,Xn
X and s: mean and standard deviation of the sample
μ and σ: mean and standard deviation of the population

Assume that the population is normal, which means that when you repeat your experiment (on n samples) lots of times (k times), and you compute and plot the sample mean $X$ each time, you will end up with the Gaussian (bell-shaped) distribution. The following graph is plotted with k = 20000 and for the average weight of tuna cans as 150gr.

One can prove that

which is a mathematical notation for saying that the sample mean X is normally distributed around the population mean 𝝁 with the standard deviation 𝝈/n^½. One can standardize this using the so-called z-score

which has a normal distribution with zero mean and the unit standard deviation: z~N(0,1)

This seems useful as the famous 68–95–99.7 rule tells us that there is 99.7% chance that:

So, after some simple manipulations, we get an estimate for the population mean:

which is very very likely to be true (OMG 99.7%!).

So, based on one small sample of size n we can estimate 𝝁, and what if the nominal weight on the label does not satisfy this estimate? The company is, very likely, cheating!

t-score

Things seem too good to be true, no? Note that the interesting calculation based on z-score requires having the standard deviation of the population. Do we have it? Often not!

One may wonder to replace with the sample standard deviation s. This is how we end up with the so-called t-value:

The problem is not still solved! The issue is that for z, we had z~N(0,1) based on which we found the final estimate for 𝝁; but, what can one say about t?

It was William Sealy Gosset, an Irish statistician working at Guinness Brewery who addressed this question. He found out that the t-score, unlike the z-score, does not follow the normal distribution but another distribution, which is what we call the t-distribution with 𝛎=n-1 degrees of freedom:

In this graph, we have plotted the normal distribution as well as t-distribution with different sample sizes or degrees of freedom:

As one can see when the size of the sample is small, the t-distribution is quite different from the normal distribution, but as the sample gets larger, the t-distribution gets closer and closer to the normal distribution. They are almost the same for samples of size more than 30.

So, the problem is solved: we know the distribution of the t-score, we can determine the region of 99.7% probability around zero, we do similar calculations as before and we obtain an estimate of the population mean using $X$ and s.

A bit of history

Gosset published his seminal work, “The probable error of a mean” in 1908, but not under his own name! He, instead, used the pen name “Student” and this is why today we call his distribution Student’s t-distribution. This is, happily, much simpler than the name he had chosen: “frequency distribution of standard deviations of samples drawn from a normal population”!

It is not crystal clear why he used a pen name and why he chose “Student”! Perhaps it was due to Guinness Brewery’s secrecy policy. Yet another interesting explanation is that the famous Karl Pearson, the editor of the journal Biometrika in which the paper was published, did not want it to be known that the paper had been written by a brewer!

Usually, people relate the t-distribution to Gosset. However, this distribution had been discovered in Germany, around the same time that Gosset got born (1876)! Friedrich Robert Helmert, in a series of articles (1875-1876) provided correct mathematical proofs of all necessary elements in finding the t-distribution. He only needed a simple calculation to reach what Gosset reached 32 years after him; but, he was not interested! Clearly, his work was unknown to English statisticians, including Gosset, and he had to solve his problem by himself. Lacking Helmert's analytic power, Gosset based his result on two “assumptions” which later proved by another great name in statistics, Ronald Fisher!

Also in 1876, the eminent German mathematician, Jacob Lüroth, had obtained the t-distribution in his paper entitled “Vergleichung von zwei Werten des wahrschein lichen Fehlers“. His work was, however, based on some unexplained assumptions.

Jun 7, 2020

Learning French with zero budget?

You want to learn French (or another language) and you prefer not to pay a cent? So, perhaps the only way is to study on your own. It can be a win-win game for your time and money; sounds quite tempting, no?

But let me tell you honestly, it will be a tough game, where you need to be self-disciplined to succeed. The bad news is that it is not sufficient! I learnt it the hard way when I moved to Germany. I had been quite successful when it came to self-studying so I was sure I could handle it this time too, and guess what? I failed, and I failed miserably!

I started Deutsche Welle German course and I gave up
I continued with Rosetta Stone and I couldn't continue
I attended a gathering for German-speaking, I did not give up, but wasted 90 minutes/week for 10 weeks!
I found Learn with Oliver, and enjoyed practising every single day. I learnt lots of useful words but as I did not have any grammatical basics I could not use these words in my routine life!

That was not only a concise list of my failures in learning German, but also a list of some useful ways one could try and get a good outcome. My problem was not about the self-discipline or material, it was more about the management! I did not choose the right material/method at the right time.

When I wanted to migrate to France, I did not want to do the same mistake! I registered myself in a free French course offered by the university, which gave me the essentials of grammar and vocabulary. Then, I was able to continue on my own. In fact, I did not do that because I was afraid to fail again and to lose my motivation. But I probably could do it. Speaking from my experience, if you know the basics and you have access to good materials you can survive this game.

Here, I would like to share a list of free materials and sources that I found useful. This list is also inspired by this post.

DUOLINGO: It is good for the elementary level as you practice new words and expressions. But, it may be annoying to repeat all those stupid sentences it asks you to practice like this:

Also, it does not replace a good course on grammar. Recently, Duolingo offers lots of short storied in its web version which are also interesting.

LEARN WITH OLIVER: This website offers flashcards for words, sentences but also texts. It is really interesting but if you want to have the pronunciation, you should pay a bit. If not, simply use something like google translate.

You can see a quick review of this website here:

Learn French With Alexa: This YouTube channel teaches a very simple and clear French. Depending on your taste, you may like it or no, but personally, I think it is good enough for a beginner who is not willing to pay for a French course!

Radio France Internationale (RFI): RFI offers TCF test but also lots of exercises from elementary (A1-A2) to advanced levels, as well as lots of audio/video reports on the current topics.

TV5Monde: TV5Monde is a French television network, broadcasting several channels of French-language programming. Like RFI they have lots of amazing audio-visual resources for learning French and it is all free!

innerFrench: innerFrench is a YouTube channel by a handsome and clearly-speaking teacher. The topics are interesting and worth being heard. The only problem is that for enjoying all these you need to be at the intermediate level!

Ina Société: Ina Société offers amazing old interviews and programs in French. They are clearly for intermediate-advanced level. But take a look there anyway. At least, it may motivate you!

Grammaire en dialogues: This is one of the best books for learning grammar by yourself. You can also find the pdf for free (not legal) or from academia.edu. The audio is available here:

May 31, 2020

Gender pay gap: Is 10% gap large? Not necessarily!

Gender pay gap (GPG) is quite a hot topic (before the era of the coronavirus, of course!) and it is, simply speaking, the difference between the salary of men and women. To make it more precise and to avoid any confusion, one should make a crucial classification of GPG right from the beginning, which is, as often as not, overlooked in the media:

Non-adjusted (uncontrolled) GPG: It compares the mean or median of salaries of men and women (for example in different countries and with different ethnicities), ignoring all or most of their differences (like full/part-time position, years of experience, working hours, different positions in terms of seniority, etc.)

Adjusted (controlled) GPG: This is the difference in the salary for equal work and when the only difference between the groups is gender.

https://www.payscale.com/data/gender-pay-gap

Both of these two types of GPG are worth being studied, however, I should highlight their different applications:

Adjusted GPG can be used to study gender discrimination in salary because we compare two groups whose main (only) difference is gender.
Non-adjusted GPG, on the other hand, cannot prove any gender discrimination in salary. There are lots of differences between groups, so, there is no way that it can confirm or reject gender discrimination.
That being said, the non-adjusted GPG can be used for other purposes, like illustrating that women do not earn as much as men, which can guide us towards understanding the underlying reasons, for example, discrimination against one gender in getting higher positions or in the access to the education (which leads to better-paid jobs). Note that, it does not necessarily do so.

Also, note that the adjusted GPG is far smaller than the non-adjusted one, which could imply that the gender’s direct effect on the earning is smaller than other factors living in the non-adjusted GPG. One may call them indirect effects.

So, hereinafter, whenever I talk about GPG I mean the adjusted gender pay gap when everything is the same between the groups, but the gender!

How much is too much?

If a reliable source reports GPG for a survey as 1%, do you consider it as “small”? What about 5%? What about 10%?

I would assume that 1% is not really important for most of us. We might think that it is just statistics, no ones expect 0% difference and always there can be a very tiny random detail which changes the numbers. With 5% feminist and women's right activist will be quite upset, and with 10% perhaps most of us!

The simple but surprising truth is that none of these values is meaningful per se! In one survey 1% can show obvious discrimination while, in another survey, even 10% can be just due to chance, and cannot prove the existence of any gender discrimination!

That was the conclusion I wanted to make out of this post! Now you know the result, let’s explain why it is so!

Let’s ask help from Mann-Whitney-Wilcoxon Test

Mann-Whitney-Wilcoxon is a statistical test telling us if the difference between the two groups is really significant or it is just due to randomness or chance. So, the survey is likely no to support gender discrimination. Look here for an introduction:

We consider four difference cases:

Small GPG which is not significant → not surprising (Example A)
Small GPG, which is significant → surprising (Example B)
Big gap, which is not significant → surprising (Example C)
Big gap, which is significant → not surprising (Example D)

Examples B and C are more interesting, as they come with a surprise and contradict our intuition.

Now, we investigate these 4 examples in details. Each example consists of two groups of men and women, each with 25 samples.

Small GPG which is not significant

The mean difference between groups is 1% in favour of men. Here, you can see the distribution of the salary for each group as well as its difference:

Here is the quantile for each group, which does not show much difference between men and women:

So, quite naturally, one would not expect this small difference to be significant. This guess can be confirmed by the Wilcoxon test:

The main point here is the p-value which shows that if everything was due to the chance, there was a 69% chance that we get a difference as big as this 1% between groups. So, we conclude that this 1% is very likely to due to chance.

Small GPG which is significant

The mean difference between groups is 1% in favour of men. But as you can most men earn quite close to the mean (55 K) and there are few men who earn more:

The quantile also shows a difference between groups, as women's salary is always below men's:

The Wilcoxon test confirms that this small difference is significant, as the p-value is almost zero, which shows that the chance of getting such an extreme and strange case is very small (67 over one trillion!)

Big GPG which is not significant

The mean difference between groups is 12% in favour of men but the variation in men's salary is quite high.

The quantile also confirms that the distribution of men's salaries is quite wide compared to women's:

In fact, this is the reason which makes this big GPG non-significant, as the Wilcoxon test shows:

It says that there is a 20% probability for this to happen simply due to chance, and this is too high to be considered as a real significant difference between the two group! So, even such a high difference may not prove gender discrimination!

Big GPG which is significant

This looks quite expected! The mean difference between groups is 19% in favour of men:

Let's look at the quantile, which shows a quite big difference:

Add caption

Then, the Wilcoxon test confirms that the difference is not likely to happen due to chance as p-value is very small, 0.0002!

What do we get from all these?

We showed that a small difference can be significant while a big difference can be non-significant. So, we should not claim or reject the gender pay gap (or similar difference/discrimination-based statements) only based on the absolute value of a difference! Numbers can be misleading, remember!

What is the “Flatland”?

I took this name from the book “Flatland: A Romance of Many Dimensions”, which is a satirical novella by Edwin Abbott Abbott (published in 1884) criticizing the hierarchy of Victorian culture, using a fictional two-dimensional world or the “Flatland”. The book talks about ignorance the flatland’s inhabitants showed in grasping the existence of the Spaceland or the three-dimensional world. So, to me, the Flatland represents the ignorance and the unwillingness to understand, which is, unfortunately, still the character of our modern world!

P.S. I got to know this book through another amazing book "The Fourth Dimension: A Guided Tour of the Higher Universes" by Rudy Rucker, which I read as a kid. I read the original text much later, when I was doing my PhD!

A Modern Flatland