Unfortunately, if you ask a "data scientist" to explain the real machinery and rationale behind the logistic regression, his answer can hardly convince anyone, even if himself (which I doubt)! They repeat the same "ambiguous" explanations about logit function and logodds and how much it looks like a line! "Look! It is so simple: \(\ln(\frac{p}{1-p})=\beta_0+\beta_1 x\)". Things even get worse when they want to justify the strange form of the "cost function" (here for example). My first goal in this post is to justify the cost function in a simple, but mathematically valid, language. In the next post, I will explain it from another point of view: maximum likelihood estimation.
The dean of the Department of Economics and Business in a respectable university would like to predict whether an application submitted for their graduate program will be accepted or rejected, based on the applicant’s GPA. She reviews 20 applications from the last year, and she puts her findings in a scatter plot, where 0 and 1 correspond, respectively, to rejection or acceptance of an application. (In French Mention is grade!)
Even though the correlation between GPA and the application decisions is not that strict, she can still say that applicants with higher GPAs “seem more likely” to get accepted. But, rather than such rough judgments, the dean prefers to be more precise and get a model describing this “likeliness” or “probability”, a model which provides the acceptance probability given a particular GPA. What can she do?
Linear regression? Not a good idea!
One may rapidly suggest the linear regression (seriously?!). It looks a bit strange, however, let’s give it a chance. Linear regression will provide this yellow line which gives the line with smallest residual:
One may say that it is better than nothing, and I agree! But notice that the values the linear model predicts are not in the interval \([0,1]\), so one cannot assign them directly to a valid probability between \([0\%-100\%]\).
Transforming the linear regression? Not a good idea neither!
One may think of “transforming” the linear model to models living in \([0,1]\). A very common and well-known trick is to use the logistic function which has a "S"-shaped curve (sigmoid function) :
\[y = g(z)=\dfrac{1}{1+e^{-z}}\]
In fact, the logistic function transforms the variable \(z\) living in \([-\infty,+\infty]\) to \(y\) which lives in \([0,1]\). This property is exactly what we were looking for!
So, it seems that the problem is solved and we can transform the output of our linear regression from \([-\infty,+\infty]\) to \([0,1]\).
To do this, we still need the linear estimator \(\eta=\beta_0+\beta_1 \,x\) but what we change is the estimated response \(\widehat{y}=\eta\), which should be modified as \(\widehat{y}=g(\eta)\).
Let’s look at what we get. The red line is the result of applying the logistic function on the yellow line. As we wished, the estimated values are in \([0,1]\) which is great. But does this model make any sense? The yellow line minimizes the residuals \(\min\|\eta-y\|\) among all other linear models. But what can we say about the red line? Does it reach the minimum of \(\min\|g(z)-y\|\)? No! In fact, we cannot find much mathematical sense for this transformed model unless we make a small modification.
Least-square of residuals? We are quite there!
Our persistent dean does not lose her hope and, this time, tries finding the solution of \(\min\|g(z)-y\|\). This is what she gets (the red line):
Now, the values are in \([0,1]\) and she is sure that this logistic model is the best model in the sense that it minimizes the norm of residuals. Seems much better, no? But she should not be so excited yet!
Do you remember that for finding the minimum of a function, we find the points where the derivative of the function is zero? In fact, this zero-derivative condition is satisfied not only for the global minimum of the function, which is our objective but also for all local minima and maxima. If the function is
“convex”, one can be sure that there is only one minimum, but what if it is
“non-convex” (as you see in the picture)?
When you ask your computer to solve the minimization problem, it searches for the points which satisfy the zero-derivative condition, starting from the initial guess you feed it with, and it gets back to you as soon as it finds such a point. But, can one guarantee that what is founded is the global minimum?
Unless the function is convex, this point can even be a global maximum! In fact, for non-convex problems, your computer may get trapped in another point other than the global minimum and consider it as the solution you asked for. This is the drawback of \(\min\|g(z)-y\|\); it is non-convex! It would be insightful to check this graphically:
The absolute difference of the estimated response \(\widehat{y}_k\) (for every data points \(k\)) and the true response \(y_k\) is called the residual \(e_k\). Now, consider three residuals, \(e_1\) and \(e_3\) when the response is 1 and \(e_2\) when it is zero:
\[e_1=|1-g(\eta_1)| \qquad \eta_1=\beta_0-0.3\,\beta_1\\
e_2=|0-g(\eta_2)| \qquad \eta_2=\beta_0+0.2\,\beta_1\\
e_3=|1-g(\eta_3)| \qquad \eta_3=\beta_0+0.8\,\beta_1\]
For simplicity, we fix \(\beta_0=1.5\). So, we plot each residual as a function of \(\beta_1\): \(e_1^2\) in (red), \(e_2^2\) (in blue), and \(e_3^2\) (green). The sum \(\sum_{k=1}^3 e_k^2\) (cost function) is also plotted in orange, which is clearly non-convex!
Notice that the minimization problem here is much more difficult than the one in the linear regression. There, the regression was
linear, so we could employ the linear algebra (QR decomposition) to simplify the problem. Here, the logistic function makes the problem
non-linear. So, we do not have any other choice but to use the zero-derivative condition.
But this is not the whole story. Even if God solve this minimization problem for you (which he do not!) it is not plausible to define the residuals similar to linear regression. Here, there are only two possible outputs: 0 (rejection) or 1 (acceptance). If the response is 0 and the estimate is 1, the residual is 1, but this
“1” is really huge, it is the largest error one could make, it is not
“only 1”! So, it makes sense to define residual differently such that the cost function shows the
“huge” difference between the two possible outputs.
Re-defining the residuals? Finally … the salvation!
Rather than the usual residual \(e_k=|\widehat{y}_k-y_k|\), we want to define a new residual \(\epsilon_k\) to be huge when \(e_k=1\). What about this
\[\boxed{\epsilon_k=-\ln(1-e_k)}\]
As we can see in the graph
- if \(e=0\) (smallest difference), \(\epsilon=0\)
- If \(e=1\) (largest difference), \(\epsilon=+\infty\)
So, for this new residual, having a difference of 1 is so huge, which is exactly what we were looking for.
Then, instead of minimizing \(\|e\|\), we consider another minimization problem \(\min \sum_k\epsilon_k\). One can show that the function \(\min \sum_k\epsilon_k\) convex, so has only one local (and global) minimum.
The more common form of writing such a cost function is as follows:
\[\boxed{\ell =\sum_{k=1}^m\Big[-y_k\,\ln(\widehat{y}_k)-(1-y_k)\,\ln(1-\widehat{y}_k)\Big]}\]
- When \(y_k=0\), the residual will be \(-\ln(1-\widehat{y}_k)\)
- When \(y_k=1\), the residual will be \(-\ln(\widehat{y})k)\)
which gives the same things as the other formula \(\min \sum_k\epsilon_k\).
Considering the same example as before, we compute the new residuals: \(\epsilon_1\) and \(\epsilon_3\) when the response is 1 and \(\epsilon_2\) when it is zero:
\[\epsilon_1=-\ln(1-e_1)\\
\epsilon_2=-\ln(1-e_2)\\
\epsilon_3=-ln(1-e_3)\]
So, we plot each residual as a function of 1: \(\epsilon_1^2\) (in red), \(\epsilon_2\) (in blue), and \(\epsilon_3\) (in green). The sum k=1mk (cost function) is also plotted in orange, which is convex!
Let us solve \(\min \sum_{k=1}^m \epsilon_k\) and obtain the logistic model as this new red line, which has values in \([0,1]\), whose residuals are based on a correct intuition, and is much easier to be computed due to convexity.
Having probabilities is nice but sometimes we want to be more strict, we want to decide: 0 or 1! In such cases, we often consider the threshold probability of 50% based on the logistic model: if the probability is more than 50% the application will be accepted and rejected otherwise.
Mathematically speaking, this is related to the sign of our linear estimator \(\eta=\beta_0+\beta_1\, x\)
- If \(\eta<0\), the probability will be more than 50%
- If \(\eta>0\), the probability will be more than 50%
- If \(\eta=0\), the probability is 50%. This gives us the “decision boundary” which is coloured black in the graph for Mention = 17.29983.
I should stop here as this post took too long. But we are not yet finished! In the next post, the dean plans to add another variable to her study, so we will face with the multiple logistic regression. It would be so much fun! We will also travel to 19th century to discover the origins of the logistic regression!