Dec 2, 2020

In the search of the “logistic regression”! (part II)

Unfortunately, if you ask a "data scientist" to explain the real machinery and rationale behind the logistic regression, his answer can hardly convince anyone, even if himself (which I doubt)! They repeat the same "ambiguous" explanations about logit function and logodds and how much it looks like a line! "Look! It is so simple: \(\ln(\frac{p}{1-p})=\beta_0+\beta_1 x\)". Things even get worse when they want to justify the strange form of the "cost function" (here for example). In part I, I already justified the cost function in a simple, but mathematically valid, language. In this post, I will explain it from another point of view: maximum likelihood estimation.

Do you remember the dean who was looking for a model to predict if an application gets accepted [here]? This time, after her earlier success, she wants to create a more complicated model!

Multiple logistic regression

Assume that the dean adds another factor to her study, the language score TCF. She plots the GPA and TCF scores of each applicant in a scatter plot and to make things more clear, adds a colour to label the dots: green for acceptance, and red for rejection.
As trivial as it may seem, higher GPA and TCF seem to be related to the decision made on the application. The model to predict the likeliness of a data point to be red (0) or green (1) is similar to the one-dimensional example but in 3D, so it looks like an S-shaped surface:
Obtaining the best surface is similar as the 1D case in the first part: we define the new residuals \(\epsilon_k=-\ln(1-e_k)\) with \(e_k=|g(\eta_k)-y_k|\) for all \(k=1,2,\cdots,m\). Then, we solve \(\min \sum_{k=1}^m \epsilon_k\).
You may remember from our earlier discussion [here] that we should not solve \(\min \|e\|\) because its corresponding cost function is non-convex. Illustrating this, consider the cost function for a very simple example: \[e_1=|1-g(\eta_1)|\qquad \eta_1=\beta_0-0.2 \beta_1 -0.3 \beta_2\\ e_2=|0-g(\eta_2)| \qquad \eta_2=\beta_0+0.1 \beta_1 +0.3 \beta_2\\ e_3=|1-g(\eta_3)| \qquad \eta_3=\beta_0+0.8 \beta_1+1.8 \beta_2\] with \(\beta_0=1.5\). The cost function is clearly non-convex.

Decision boundaries

As we mentioned in the first part, the decision boundary consists of all points where \(\eta=0\), that is: \[\beta_0+\beta_1x^{(1)}+\beta_2 x^{(2)}=0\] where \(x^{(1)}\) is the first variable (Mention) and \(x^{(2)}\) is the second variable (TCF), and the set \(\beta_0,\beta_1,\beta_2\) is the best parameter set we obtained from the minimization problem. This gives the black line in the graph.
For this simple case, the decision boundary is a straight line. But if we need to describe more complicated cases, we can simply add higher-order terms like \((x^{(1)})^2\) or \(x^{(1)}*x^{(2)}\), which leads to curved boundaries. For this example, if we add the squared of TCF number to our logistic model, we will obtain:

Who came up with such a strange name, logistic regression?

In 1798 Thomas Malthus published his book “An Essay on the Principle of Population”, warning of “future difficulties, on an interpretation of the population increasing in geometric progression (so as to double every 25 years) while food production increased in an arithmetic progression, which would leave a difference resulting in the want of food and famine, unless birth rates decreased”. It was then in 1838 that Pierre François Verhulst, a Belgian mathematician, modified Malthus’s model and obtained a more realistic function like \(g(z)=\dfrac{1}{1+e^{-z}}\) predicting the population of countries quite well. He called it the logistic function but did not provide any justification for this name!

Maximum likelihood estimation: the hidden calculation behind the logistic regression

Maximum likelihood estimation or MLE, as we often call it, is a way to find the probability distribution function fitted to some data. Here is a good introduction:
For example, if we assume that the Gaussian normal distribution \[f_{normal}(x) = \dfrac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] is to describe the data’s distribution, we need to find the best two parameters, mean \(\mu\) and the standard deviation \(\sigma\), to describe the data. But what if the data (the event or output) is binomial (0 and 1, or rejection and acceptance)? We need the binomial distribution.

Consider
  • \(m\) as the total number of observations (the total number of acceptances and rejections)
  • \(p\) is the (average) acceptance chance of an application
  • \(1-p\) is the (average) rejection chance of an application
The probability of having \(x\) accepted applications is \(p^x\) while \(m-x\) rejections occur with the probability of \((1-p)^{m-x}\). So do we multiply these two probabilities to get the probability of getting \(x\) acceptances and \(m-x\) rejections? No! The \(x\) acceptances and \(m-x\) rejections can occur in different orders, that is, there are several different ways of distributing them. The binomial coefficient \(C_{m,x}\) gives the number of possible combinations of having \(x\) acceptances and \(m-x\) rejections, so that the probability writes: \[\boxed{f_{binomial}(x)=C_{m,x}\,p^x\,(1-p)^{m-x}}\] where \(C_{m,x}\) reads \[C_{m,x}=\dfrac{m!}{x!\,(m-x)!}\]
Example (binomial coefficient): If \(m=4\) and \(x=2\) (2 acceptances), then there are 6 different combinations in which these 2 acceptances occur (❌ is the rejection and ✔ is the acceptance):

✔✔❌❌
❌❌✔✔
✔❌❌✔
❌✔✔❌
❌✔❌✔
✔❌✔❌

The binomial coefficient will be then 6 as the formula suggests: \(C_{4,2}=\dfrac{4!}{2!\, 2!}=6\).
Now that we reviewed the basics of the binomial distribution, we can see how the maximum likelihood estimation works for the binomial distribution: it finds the acceptance chance p which maximises the probability of having the observed number of acceptance and rejections. For example, assume that 7 out of 10 applications are rejected. So, the probability will be \[C_{10,3}\, p^3 \,(1-p)^7\]
MLE looks for the \(p\) which makes this probability or likelihood of our result as large as possible. Simple calculations show that it is p=0.3which does this job! You can see that from the plot as well. Notice for example that if we considered \(p=0.6\), which assumes 60% acceptance chance, it would be quite unlikely (less than 5%) that we got such a result (only 3 acceptances).
However, in our problem of predicting the decision based on GPA, things are not as easy as this because \(p\) is clearly not the same for all applications; \(p\) is not a constant anymore but a function of independent variables of the model (like GPA). Thus, instead of \(p\), we have \(p_k\) for each application \(k=1,2,\cdots,m\). Ignoring the constant \(C_{m,x}\), we will write the binomial distribution as \[L:=\sum_{k=1}^m p_k^{y_k}(1-p_k)^{(1-y_k)}\] where \(y_k\) is the decision. And what is \(p_k\)? Similar to our discussion in the first part, we consider this function to be the logistic function \(p_k=g(\eta_k)=\dfrac{1}{1+e^{-\eta_k}}\). So, the objective is to find the maximum of this likelihood function \(L\).

As "logarithm of products is a summation of logarithms", people often work with log-likelihood \(\ell\), which is the logarithms of the likelihood, that is \(\ell=\ln(L)\) or \[\boxed{\ell :=\sum_{k=1}^m\Big( y_k \ln(p_k)+(1-y_k) \ln(1-p_k)\Big)}\] Don’t you think that this log-likelihood looks a bit familiar? This is exactly the cost function we wanted to minimize in the first part of this article, but with a negative sign! So, maximising the likelihood function is, in fact, minimizing the cost function! Now, we have a clear idea where that function originated from.