A Modern Flatland

Nov 28, 2021

The salvation: when I finally understood "dynamic programming"!

For someone without classical education in computer science, like me, dynamic programming interview problems can be like a nightmare! Fianlly, I could found my way to understand these kinds of problems, and this is what I am going to talk about in this post.

Acknowledgements: I use hilite to make my python code blocks for this post. So, I owe them a sincere thank-you message! Also, the thumbnail is from Avik Das' post on dynamic programming. I found it when searching for a thumbnail, but the post is truly interesting and well-written. Recommended to read!

During an interview procedure, I received an assignment asking for something like "the minimum number of times one needs to do an experiment to guarantee some value" . I approached the problem like a mathematician (whom I am, roughly speaking!), worked hard on it for one week, and I got my rejection notice, a couple of days after! I asked "But why? What was wrong there?" And the answer was "... the correct answer is X, but your answer is Y. One possible solution is using dynamic programming." Oh! That is it. I don't know about you, but this was the first time I heard the term "dynamic programming". I was wondering to know what is this so-called dynamic programming that failed me to get the job I really liked. I tried different approaches: "Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming" course in Coursera, geeksforgeeks exercises, and HackerRank. They were great but not helpful for me; I still had the problem to understand what should I do with a DP problem.

One night, and out of blue, I found "Dynamic programming for coding interviews", and I immediately read the first 6 chapters until midnight! (Note: The book is not voluminous, around 130 pages.)

After, I could understand the DP approach much better than before, it was like a miracle for me! The reason might be that the book, often, approaches each example in three different (but related) ways:

recursive approach: handle one step and delegate the rest back to the function
memoization approach: recursive approach + save/cash the result to avoid computing that again in future
dynamic programming approach: keep the recursive soul but without recursion, build the solution from bottom to top

Comparing the code of these approaches clarified DP much better than any other thing I have ever read before!

For example, assume that you are asked to compute the $n$-th term of the famous Fibonacci sequence $1,1,2,3,5,8,\dotsc$ with the recursive rule of \[\boxed{f_n = f_{n-1} + f_{n-2}}\] In the recursive approach, we define a function which simply breaks the main problem of computing fib(n) to two smaller problems, by calling itself for computing $(n-1)$-th and $(n-2)$-th terms:

def fib(n):
    if (n==1 or n==2):
        return 1
    else:
        return fib(n-1) + fib(n-2)

This code calls the function fib() several times, which makes it slow and with exponentially high complexity, while lots of these call are repetitive. For example, if the goal is to compute fib(5), we call fib(4) and fib(3). The former calls fib(3) and fib(2) while the latter calls fib(2) and fib(1). This has been illustrated in the following picture (from the book).
So, just in the second "step" of this recursion, fib(3) has been called twice. One can imagine that the number of these repetitions get larger and larger for the next terms of the sequence. This is the fact that makes this code impractical and inefficient.

What if we save somewhere the values that we have already computed, so that we can just read them from memory rather than adding an "overhead" of recursion. This is what is called "memoization". Look how the code looks like in this case:

memo = [0] * 100
def fib(n):
    if memo[n]:
        return memo[n]
    if (n==1 or n==2):
        memo[n] = 1
    else:
        memo[n] = fib(n-1) + fib(n-2)
    return memo[n]

We still have the same recursion but if memo is already filled in, we just read from the memory. This will improve the efficiency a lot and usually it is easy to be done. But we still can do better, the "dynamic programming approach". For DP, we adopt the buttom-top approach, that is, to start from smaller problems in order to solve larger ones.
In this example, it means that we start to compute fib(3), fib(4), ... until we reach the term that we would like to compute. We combine cashing the already-computed results and the recursive code to this very fast DP-based code:

def fib(n):
    memo = [0] * 100
    memo[0] = 1
    memo[1] = 1
    for i in range(2, n):
        memo[i] = memo[i-1] + memo[i-2]
    return memo[n-1]

There is no recursive funcion call anymore; it is simply reading from memory! I do not want to give a very accurate benchmarking for the computation time here, but this is what I got on my machine for $n=35$. You see that the difference is huge!

Approach	Executation time (ms)
Recursive	11364
Memoization	0.08
DP	0.03

This example is certainly too simple. One even could think directly of the DP method. But, for better or worse, it is not always like this. There are cases where it is more difficult to imagine the structure of the DP solution right at once. For example, what do you think about this problem from the above-mentioned book?

Hint: Imagine what would happen if you put the first tile vertically/horizontally, and start from the recursive solution.

The rope problem

Now, I can get back to the problem I was working on for my interview assignmenet: the rope problem

You are given $M\geq 1$ identical ropes, and you are asked to find out the strength of the rope by attaching weights between $1$ kg and $N$ kg. If a rope fails during a test of load $w$, it means that its strength is less than $w$. The strength is not bigger than $N$. You need to determine the minimum number of tests that are necessary to find this strength.

The tricky point is that the number of ropes is limited. If we had infinitely-many ropes, the binary search would be the best approach. On the other hand, with only one rope, we would not have any chance to miss, so we had to do a "linear search": to load the rope with weights $1,2,3,4,\dotsc$ respectively, until it breaks.

At that time, the way I approached this problem was mathematical rather than algorithmic. I tried to find a clever choice of weights to be tested, to minimize the number of tests needed. I still believe that my answer was interesting, and could reduce the exponential complexity of a brute-force search to polynomial. However, it was not what they were looking for!

Let's start to solve this problem with the recursive approach: Given a rope of certain strength bound ($n$ in the beginning), we do a search over all weights. As we need to consider the most difficult case, we find the maximum of two cases:

The rope fails under weight $w$, then we call back the function setting $w-1$ for the strengh bound, and reduce the number of available ropes.
The ropes does not fail, so we call back the function setting $(n-w)$ as the strengh lenght. Notice that, in fact, the strengh would be between $(w+1)$ and $n$, but what really matter is the number of weights remains to test, not the upper-bound and lower bound.

We, then, take the minimum number of tests over all different weights that we just searched over.

def testRope(m, n):
    minTest = n
    if m == 1:
        return minTest
    if n <= 0:
        return 0
    if n == 1 or n == 2:
        return n
    for w in range(1,n+1):
        minTest = min(minTest, 1 + max(testRope(m-1,w-1), testRope(m,n-w)))
    return minTest

The code works, nice! But, it does not give me any result for $n>25$ even after minutes of waiting! So, let's think about the DP approach. We keep more or less the same recursion but not as recalling a function over and over. We adopt the same methodology as in the Fibonacci example; we save the results of smaller problems in a matrix considering that:

Testing with one rope is equivalent to the linear search, so testRope(1,n)=1
If the strengh bound is 1 or 2, one can clearly see that we need 1 or 2 tests to determine the lenghts, so testRope(m,1) = 1 and testRope(m,2) = 2
If there remains only one rope, need to do $n$ tests (linear search), so testRope(1,n) = n and testRope(m,2) = 2

These three simples rules are essentially lines 5-7 of the code below, where memo is our two-dimensional list saving the computed results. There is just a small tweak here in the indices, due to the zero-based indexing in python.

def testRope(m, n):
    memo = [[0 for i in range(n)] for j in range(m)]
    for i in range(m):
        for j in range(n):
            memo[0][j] = j + 1
            memo[i][0] = 1
            memo[i][1] = 2

    for i in range(1,m):
        for j in range(2,n):
            for w in range(j+1):
                if w == 0:
                    memo[i][j] = min(j+1, 1 + memo[i][j-1])
                elif w == j:
                    memo[i][j] = min(memo[i][j], 1 + memo[i-1][j-1])
                else:
                    memo[i][j] = min(memo[i][j], 1 + max(memo[i-1][w-1], memo[i][j-w-1]))
  return memo[m-1][n-1]

The next step is to solve more complicated problems. namely, with $n=3$ and 2 ropes. What do we need for solving this? Exactly the simpler problems that we just solved! We test a weight, if it fails we have a linear search to do, if it does not fail, we have a usual search of length 1 or 2, whose solution is known to us! This is exactly what we have in line 17, and we put it in a loop. So, we build the final solution based on these sub-problems.

What about timing here? Let's take a look for the case $m=2$ and $n=25$:

Approach	Executation time (ms)
Recursive	25700
DP	0.27

Ok! I am finished with this post. I hope if you have not been familiar with DP, this post could have given you a better idea of how to approach it. And don't forget to check out the book. It explains the stuff much better than I did :)

Aug 1, 2021

An intro to NLP (not the one that you think!)

In the data science era, the first thing that may come to your mind when you hear "NLP" is probably the Natural Language Processing, isn't it? I have studied natural language processing to some extent and no doubt that it is a really interesting topic to dive into! But, there is still another (much older) NLP: Non-Linear Programming: a non-linear optimization problem with non-linear (equality or inequality) constraints. In this post, I am going to talk about this NLP but quite briefly. A more in-depth discussion requires writing a book, not a blog post!

If you have passed some elementary machine learning or optimization course you should have learned the Gradient Descent Method which is very famous in deep learning these days. But, the GD method (without any modification) is not well-suited for constrained optimization problems, where you need to minimize a function while satisfying a set of other equalities and/or inequalities. Let me give you an example:

minimize $f(x_1,x_2) = -x_1\,x_2$
subject to
- $h(x_1,x_2) = 20x_1+15x_2-30=0$
- $g(x_1,x_2) = (x_1/2)^2 + x_2^2-1\leq 0$
- $0\leq x_1,x_2\leq 3$

In this example, we have three type of constraints: equality constraint $h(x_1,x_2)=0$, inequality constraint $g(x_1,x_2)\leq 0$ and bound constraint on design or state variable $x_1,x_2\in [0,3]$. So, the main question is how to take these constraints into the minimization problem. Lagrange multiplier method

The Lagrange multiplier method is well-known for equality-constrained optimization. Simply speaking, we minimize $f(\mathbf{x})-\lambda h(\mathbf{x})$ over $x$ and $\lambda$. I am not going to discuss the details, so if you feel like you need more explanation, do not hesitate to watch this video:

If we did not have the inequality constraint, the problem was almost solved with this Lagrange multiplier method. But how to bring that constraint into the play? This is where a very interesting concept of the NLP appears: the slack variable, whose role is to transform the inequality constraint into an equality constraint: we define a new variable $z$ such that \[g(x_1,x_2)+z^2 = (x_1/2)^2 + x_2^2-1+z^2= 0\] and now we have the equality constraint!

We can then apply the Lagrange multiplier method with this additional slack variable $z$ and find the Lagrangian as: \[ \mathcal{L}(x_1,x_2,z,\lambda,\beta) = f(x_1,x_2) - \lambda\,h(x_1,x_2) -\beta\, \left(g(x_1,x_2)+z^2\right) \] We use $\lambda$ for the Lagrange multiplier of the equality constraint and $\beta\leq 0 $ for the inequality constraint (for a maximizing problem $\beta\geq 0$)

So, the first-order optimality condition will be obtained as: \[ \partial_{x_1}\mathcal{L} = \partial_{x_1}f-\lambda\,\partial_{x_1}h-\beta\,\partial_{x_1}g=0\\ \partial_{x_1}\mathcal{L} = \partial_{x_2}f-\lambda\,\partial_{x_2}h-\beta\,\partial_{x_2}g=0\\ \partial_{z}\mathcal{L} =2\beta\,z=0\\ \partial_{\lambda}\mathcal{L} = h(x_1,x_2)=0\\ \partial_{\beta}\mathcal{L} = g(x_1,x_2)+z^2=0 \] Note that the slack variable can be easily removed from the system (using the third and last equations) and we are left with: \[ \partial_{x_1}f-\lambda\,\partial_{x_1}h-\beta\,\partial_{x_1}g=0\\ \partial_{x_2}f-\lambda\,\partial_{x_2}h-\beta\,\partial_{x_2}g=0\\ \beta\,g=0\\ h = 0 \] Now we can see that this system is equivalent to applying the Lagrange multiplier method to for $\mathcal{L} = f - \lambda\,h -\beta\, g$ subject to $h=0$ and $\beta\, g = 0$, as if the slack variable has never been existed!

The equation $\beta\,g=0$ brings up two cases:

$\beta = 0$ and $g< 0$ (non-binding or inactive constraint)
$\beta\neq 0$ and $g=0$ (binding or active constraint)

Non-binding constraint case
For the non-binding constraint case, the constraint $g\leq 0$ is removed completely from the system and we should solve \[ \partial_{x_1}f-\lambda\,\partial_{x_1}h=0\\ \partial_{x_2}f-\lambda\,\partial_{x_2}h=0\\ h = 0 \] or \[ -x_2-20\lambda = 0\\ -x_1-15\lambda = 0\\ 20x_1+15x_2=30 \] which gives the result \[\boxed{x_1=0.75, \qquad x_2=1, \qquad \lambda = -0.05}\]

Notice that it is the same solution that one could have found naively by substituting $x_1$ or $x_2$ in the function $f$ using the equality constraint (which is a line). However, as one can see in the following picture, the solution is not a feasible solution as it does not respect the inequality constraint:

Binding constraint case
For the binding constraint case, one gets a non-linear system of equations to be solved: \[ -x_2-20\lambda-0.5\,\beta\,x_1 = 0\\ -x_1-15\lambda-2\,\beta\,x_2 = 0\\ 20x_1+15x_2=30\\ 0.25x_1^2+x_2^2=1 \] which can be solved, for example, using fsolve function in scipy.optimize, resulting in \[\boxed{ x_1 = 0.81511541, \qquad x_2=0.91317946, \qquad \lambda = -0.04391383, \qquad \beta = -0.08563926}\] which satisfies all the constraints, also $\beta<0$. Closing remarks

A very similar procedure will be applied in the general case with several equality and inequality constraint, under the name of Karush–Kuhn–Tucker conditions.
One can also take the design variable bounds into the Lagrangian, for example check out the Interior-point or barrier method.
For numerical purposes, rather than handling all different binding and non-binding cases separately, one would use, for example, Exterlor Penalty Functlon method or Augmented Lagrange Multiplier method, which adds a penalty term to the main minimization problem and enforces respecting the constraints, like: \[ \alpha_h \Big[\sum_k h_k(\mathbf{X})^2\Big]+\alpha_g\Big[\left(\max(0,g(\mathbf{x})\right)^2\Big] \] for some positive constants $\alpha_h$ and $\alpha_g$, which penalizes non-zero values of $h$ and positive values of $g$ over the discretized domain indexed by $k$. Then, an unconstrained optimization method (such as GD) can be used: This is for example the case in minimize function in scipy.optimize. In the jupyter notebook you can see that the result is the same as the "analytical" one.

Jul 26, 2021

On the "Gershgorin circle theorem": How to bound the eigenvalues beautifully!

Gershgorin circle theorem is really simple despite its fancy name. I got to know this theorem during my PhD, when I was working on the solvability of numerical methods, and I needed to show that the spectrum of the system I had to solve did not contain the origin. Here, I give a very brief introduction to this thereom.

Let's start from the end, the theorem! Gershgorin circle theorem says that, given a square matrix A, no eigenvalue can be located outsides a finite number of discs.

More precisely, for every row $i$, we consider the disc $D(a_{ii},R_i)$ with

center $a_{ii}$ (the diagonal entry of $i$-th row)
radius $R_i:=\sum_{j\neq i }|a_{ij}|$ (sum of off-diagonal entries)

Then, every eigenvalue lies within at least one of discs $D(a_{ii},R_i)$.

Consider a very simple case of a $3\times 3$ matrix: \[A = \begin{bmatrix} 2 & 0.5 & 0.5\\ 0 & 5 & 1\\ 1 & 1 & 8\\ \end{bmatrix}\] We can easily see that there are three discs: \[D(2,1), \quad D(5,1), \quad D(8,2)\] which contain the eigenvalues $1.95, 7.76, 5.3$: (you can find the jupyter notebook here)

The proof of this theorem is quite easy. One just needs to use the definition of the eigenvalue $Ax=\lambda x$ and the triangle inequality to show that $\boxed{|\lambda-a_{ii}|\leq R_i}$. The way that Semyon Aronovich Gershgorin proved it in 1931 is a bit different though, but similar.

He first started from another ansatz which was an extension of the work of Levy:

Then, based on this, he stated his own beautiful result:

Note that this bound gets better as the off-diagonal part gets smaller (so smaller radius). The limit case is a diagonal matrix!

Brainteaser:

Now, can you answer why diagonally dominant matrices are really desired in numerical analysis? ;)

Mar 4, 2021

PCA or SVD? Are they really different?

I have been educated as an applied mathematician, where SVD was in our DNA! So, quite naturally, when I first heard people, mostly in data science, talked about PCA I got hurt! "Oh! It is SVD, do not call it PCA!" As a matter of fact, I was right, strictly speaking, however, the whole story is more interesting that just a name. Here, I talk about SVD, PCA, their origins, and more!

What is SVD (Singular Value Decomposition)?

Gilbert Strang teaching his famous linear algebra course. Take a closer look at the blackboard, SVD!

SVD is a matrix factorization in which the rectangular matrix $X\in\mathbb{R}^{m\times n}$ is decomposed as the multiplication of two square unitary matrices $U\in\mathbb{R}^{m\times m}$ and $V\in\mathbb{R}^{n\times n}$ and the diagonal matrix $\Sigma\in\mathbb{R}^{m\times n}$: \[\boxed{X = U\,\Sigma\,V^t}\] The singular values are really important as their absolute value implies the importance of the corresponding columns of $V$. Noticing that this matrix $X$ can be seen as a linear transformation from $\mathbb{R}^n$ to $\mathbb{R}^m$. So, when we have the matrix-vector multiplication $X\xi$, the role of each term can be interpreted as below:

Columns of $V$ give an orthonormal basis for $\mathbb{R}^m$. So $V^t\xi$ is a change of basis from the standard basis to the basis given by $V$.
$\Sigma$ has non-zero entries only on its diagonal whose role is to scale each of the components of $\xi$, already projected to the new basis. It also adapts the dimension, for example by chopping off extra columns (when $m\leq n$) or extra rows (when $m\geq n$) of $V^t\xi$.
$U$ does a similar projection as of $V$ but in $\mathbb{R}^m$

If you would like to have more detailed explanation, you can simply consult this great course of Gilbert Strang:

SVD has been derived by Beltrami (1873) and Jordan (1874), for bilinear forms. What is PCA (Principal Component Analysis)?

Illustration of Karl Pearson, in the very first paper on PCA!

It was Karl Pearson in 1901 who founded PCA, in a paper titled "On Lines and Planes of Closest Fit to Systems of Pointsin Space.". In fact, he was trying to answer a very interesting question: How to represent "a system of points" in a high-dimensional space $\mathbb{R}^d$, by the "best-fitting" straight line or plane? He defined the "best" line as the line whose mean squared distance to all points in the smallest. Using the notation of correlation, he changed the problem to finding the ellipsoid (or "ellipsoid of residuals") with the largest axis:

So, he found one direction in the data which can best explain the data, and another direction (perpendicular to the first one) which is the worst direction to explain the data. Note that theses direction (or lines) are not necessarily along the Cartesian axes.

This method often used to find the "best" low-dimenstional approximation of a high-dimensional data; to ignore the least important data and just to keep the data which really matter. So, how do they relate to each other?

Feb 7, 2021

What is an Asymptotic Preserving (AP) scheme? A conceptual journey!

The term of Asymptotic Preserving has been coined by Shi Jin, a well-known Chinese mathematician. However, it has been rooted in USSR (as lots of other things in mathematics). In this post, I am going to review AP schemes, in a simplified language..

Let's right start with an example: Assume that you want to solve this simplified advection-diffusion equation: \[\nu\,u''+ u'=0\qquad \text{ over } [0,1] \quad \text{such that }\quad u(0)=0, u(1)=1,\] with two Dirichlet (fixed values) boundary conditions. One can simply confirm (believe me, it's really simple!) that the function $u(x)=\dfrac{1-e^{-x/\nu}}{1-e^{-1/\nu}}$ satisfies this equation. If we plot this exact solution with different diffusivity constant $\nu$, we will get this:

As one can see, the solution develops a so-called "boundary layer" near the left boundary, which gets sharper and sharer as $\nu\to 0$. So far so good, hein? But what happens if you want to obtain this solution numerically? This is what Il'in tried to do in 1969, in a paper titled "Differencing scheme for a differential equation with a small parameter affecting the highest derivative", and he found out that the answer is not that easy!

First, he tried a very simple finite difference discretization of the second and first-order terms, and he got the numerical solution as $u^h_h=\dfrac{1-\left(\frac{2\nu-h}{2\nu+h}\right)^k}{1-\left(\frac{2\nu-h}{2\nu+h}\right)^N}$ which gives the following plot with $h \approx 0.01$:

The solution does not make any sense when $\nu\ll h$. It is not crystal clear to me the method he used here (old Soviet style paper-writing!) but one may guess that a second-order central difference for the advection term would produce a similar effect, which you can confirm in my jupyter notebook.

Investigating this issue, he then succeeded to find another discretization (probably upwind method) which provided a reasonable solution $u^h_k=\dfrac{1-\left(\frac{\nu}{\nu+h}\right)^k}{1-\left(\frac{\nu}{\nu+h}\right)^N}$:

In summary: Depending on the discretization one uses for the each term, the numerical solution may get inaccurate (or even unstable) if $\nu$ gets much smaller than your grid size! In other words, the solution may look good for $ν=1$ but it is not uniformly accurate when $ν\to 0$.

This paper, as well as a couple of others published in USSR, founded the methods that are known today as AP or Asymptotic Preserving methods. As the name implies, AP methods are defined to "preserve the asymptotic". To understand it better, consider the famous AP diagram:

$\mathcal{M}^\varepsilon$ is the $\varepsilon$-dependent equation we want to solve, like the advection-diffusion example. We know that as $\varepsilon\to 0$ (or $\nu$ in our example) this equation will converge to its asymptotic limit $\mathcal{M}^0$. Now, consider $\Delta$ as the discretization parameter (like grid size $h$ in our example). Then, the numerical deiscrete model $\mathcal{M}^\varepsilon_\Delta$:

Converges to the continuous equation $\mathcal{M}^\varepsilon$, for a fixed $\varepsilon$ when the grid gets refined $\Delta\to 0$.
Converges to the numerical asymptotic solution $\mathcal{M}^0_\Delta$, for a fixed discretization when the parameter vanishes $\varepsilon \to 0$

As we observed in the advection-diffusion example, the $\mathcal{M}^\varepsilon_\Delta\to \mathcal{M}^0_\Delta$ convergence did not hold unless the grid get refined as $\varepsilon\to 0$. This poses a big issue in practice, as refining the grid increases the computational cost of the method. Sometimes, this convergence holds, but $\mathcal{M}^0_\Delta$ is not consistent with the continous asymptotic limit $\mathcal{M}^0$ perhaps because of an artificial $\mathcal{O}(1/\varepsilon)$ diffusion term which does not make the asymptotic method unstable but incosistent with the real model. One can still find lots of other interesting examples to explain AP schemes. I did a brief introduction here with another example of van der Pol system in my SlideShare:

Jan 11, 2021

12 Rules for Life (A summary of Jordan Peterson's book)

Jordan Peterson is a difficult person to talk about, as he has lots of haters and lovers. Here, for better or worse, I do not plan to talk about his ideas like gender inequality. My aim to so much more humble!

I heard of Peterson about 2 years ago for the first time and my first reaction was to search for his videos on YouTube. I found lots of debates, lots of videos of his courses in the university, and so on and so forth. I found his way of talking charming and beautiful in terms of the language he was employing (note that I am NOT a native English speaker!). I believe that writing speaks better than speaking; it transfers the message more clearly while giving more weight to the content than the language. You may fool people by talking in beautiful English but not that easily in writing. So, I searched for his books and started with “12 Rules for Life”.

To be honest, I loved the book! It does not mean that I confirm 100% of the contents. It does not mean that I do not find it strange that he only blames Mao and Stalin but not Lyndon B. Johnson for all he did in Vietnam. I felt close with his 12 rules and I liked the way he tries to justify it based on literature, bible and sometimes science. A friend of mine told me recently that this book is pseudo-science, and does not offer anything better than the cheap "road to success" books in the market. I agree that is it not science, strictly speaking, but this is not the point! If you give a rule for life, the relevant question to ask is if the rules are working or no, if by applying them you would live a better and happier life or no. If your answer is yes, I believe that almost no one cares whether the rules are scientifically-proven!

Anyway, with this small introduction, I will make a summary of each chapter here, in this document (updating continuously). My goal is to collect all the important points altogether. I review the chapters one-by-one and share my summaries here with you, and you can follow me on this journey!

Dec 2, 2020

In the search of the “logistic regression”! (part II)

Unfortunately, if you ask a "data scientist" to explain the real machinery and rationale behind the logistic regression, his answer can hardly convince anyone, even if himself (which I doubt)! They repeat the same "ambiguous" explanations about logit function and logodds and how much it looks like a line! "Look! It is so simple: $\ln(\frac{p}{1-p})=\beta_0+\beta_1 x$". Things even get worse when they want to justify the strange form of the "cost function" (here for example). In part I, I already justified the cost function in a simple, but mathematically valid, language. In this post, I will explain it from another point of view: maximum likelihood estimation.

Do you remember the dean who was looking for a model to predict if an application gets accepted [here]? This time, after her earlier success, she wants to create a more complicated model! Multiple logistic regression
Assume that the dean adds another factor to her study, the language score TCF. She plots the GPA and TCF scores of each applicant in a scatter plot and to make things more clear, adds a colour to label the dots: green for acceptance, and red for rejection.

As trivial as it may seem, higher GPA and TCF seem to be related to the decision made on the application. The model to predict the likeliness of a data point to be red (0) or green (1) is similar to the one-dimensional example but in 3D, so it looks like an S-shaped surface:

Obtaining the best surface is similar as the 1D case in the first part: we define the new residuals $\epsilon_k=-\ln(1-e_k)$ with $e_k=|g(\eta_k)-y_k|$ for all $k=1,2,\cdots,m$. Then, we solve $\min \sum_{k=1}^m \epsilon_k$.

You may remember from our earlier discussion [here] that we should not solve $\min \|e\|$ because its corresponding cost function is non-convex. Illustrating this, consider the cost function for a very simple example: \[e_1=|1-g(\eta_1)|\qquad \eta_1=\beta_0-0.2 \beta_1 -0.3 \beta_2\\ e_2=|0-g(\eta_2)| \qquad \eta_2=\beta_0+0.1 \beta_1 +0.3 \beta_2\\ e_3=|1-g(\eta_3)| \qquad \eta_3=\beta_0+0.8 \beta_1+1.8 \beta_2\] with $\beta_0=1.5$. The cost function is clearly non-convex.

Decision boundaries
As we mentioned in the first part, the decision boundary consists of all points where $\eta=0$, that is: \[\beta_0+\beta_1x^{(1)}+\beta_2 x^{(2)}=0\] where $x^{(1)}$ is the first variable (Mention) and $x^{(2)}$ is the second variable (TCF), and the set $\beta_0,\beta_1,\beta_2$ is the best parameter set we obtained from the minimization problem. This gives the black line in the graph.

For this simple case, the decision boundary is a straight line. But if we need to describe more complicated cases, we can simply add higher-order terms like $(x^{(1)})^2$ or $x^{(1)}*x^{(2)}$, which leads to curved boundaries. For this example, if we add the squared of TCF number to our logistic model, we will obtain:

Who came up with such a strange name, logistic regression?

In 1798 Thomas Malthus published his book “An Essay on the Principle of Population”, warning of “future difficulties, on an interpretation of the population increasing in geometric progression (so as to double every 25 years) while food production increased in an arithmetic progression, which would leave a difference resulting in the want of food and famine, unless birth rates decreased”. It was then in 1838 that Pierre François Verhulst, a Belgian mathematician, modified Malthus’s model and obtained a more realistic function like $g(z)=\dfrac{1}{1+e^{-z}}$ predicting the population of countries quite well. He called it the logistic function but did not provide any justification for this name!

Maximum likelihood estimation: the hidden calculation behind the logistic regression
Maximum likelihood estimation or MLE, as we often call it, is a way to find the probability distribution function fitted to some data. Here is a good introduction:

For example, if we assume that the Gaussian normal distribution \[f_{normal}(x) = \dfrac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] is to describe the data’s distribution, we need to find the best two parameters, mean $\mu$ and the standard deviation $\sigma$, to describe the data. But what if the data (the event or output) is binomial (0 and 1, or rejection and acceptance)? We need the binomial distribution.

Consider

$m$ as the total number of observations (the total number of acceptances and rejections)
$p$ is the (average) acceptance chance of an application
$1-p$ is the (average) rejection chance of an application

The probability of having $x$ accepted applications is $p^x$ while $m-x$ rejections occur with the probability of $(1-p)^{m-x}$. So do we multiply these two probabilities to get the probability of getting $x$ acceptances and $m-x$ rejections? No! The $x$ acceptances and $m-x$ rejections can occur in different orders, that is, there are several different ways of distributing them. The binomial coefficient $C_{m,x}$ gives the number of possible combinations of having $x$ acceptances and $m-x$ rejections, so that the probability writes: \[\boxed{f_{binomial}(x)=C_{m,x}\,p^x\,(1-p)^{m-x}}\] where $C_{m,x}$ reads \[C_{m,x}=\dfrac{m!}{x!\,(m-x)!}\]

Example (binomial coefficient): If $m=4$ and $x=2$ (2 acceptances), then there are 6 different combinations in which these 2 acceptances occur (❌ is the rejection and ✔ is the acceptance):

✔✔❌❌

❌❌✔✔

✔❌❌✔

❌✔✔❌

❌✔❌✔

✔❌✔❌

The binomial coefficient will be then 6 as the formula suggests: $C_{4,2}=\dfrac{4!}{2!\, 2!}=6$.

Now that we reviewed the basics of the binomial distribution, we can see how the maximum likelihood estimation works for the binomial distribution: it finds the acceptance chance p which maximises the probability of having the observed number of acceptance and rejections. For example, assume that 7 out of 10 applications are rejected. So, the probability will be \[C_{10,3}\, p^3 \,(1-p)^7\]
MLE looks for the $p$ which makes this probability or likelihood of our result as large as possible. Simple calculations show that it is p=0.3which does this job! You can see that from the plot as well. Notice for example that if we considered $p=0.6$, which assumes 60% acceptance chance, it would be quite unlikely (less than 5%) that we got such a result (only 3 acceptances).

However, in our problem of predicting the decision based on GPA, things are not as easy as this because $p$ is clearly not the same for all applications; $p$ is not a constant anymore but a function of independent variables of the model (like GPA). Thus, instead of $p$, we have $p_k$ for each application $k=1,2,\cdots,m$. Ignoring the constant $C_{m,x}$, we will write the binomial distribution as \[L:=\sum_{k=1}^m p_k^{y_k}(1-p_k)^{(1-y_k)}\] where $y_k$ is the decision. And what is $p_k$? Similar to our discussion in the first part, we consider this function to be the logistic function $p_k=g(\eta_k)=\dfrac{1}{1+e^{-\eta_k}}$. So, the objective is to find the maximum of this likelihood function $L$.

As "logarithm of products is a summation of logarithms", people often work with log-likelihood $\ell$, which is the logarithms of the likelihood, that is $\ell=\ln(L)$ or \[\boxed{\ell :=\sum_{k=1}^m\Big( y_k \ln(p_k)+(1-y_k) \ln(1-p_k)\Big)}\] Don’t you think that this log-likelihood looks a bit familiar? This is exactly the cost function we wanted to minimize in the first part of this article, but with a negative sign! So, maximising the likelihood function is, in fact, minimizing the cost function! Now, we have a clear idea where that function originated from.

Oct 31, 2020

Discovering the lost "Bendixson's theorem"

Working on my PhD thesis, I had to investiage the "stability" of such an "evolutionary system" $U'(t)=(A+B)\,U$, where $U=(u_1,u_2,\cdots,u_n)$ is a vector and both $A$ and $B$ are square matrices of size $n$. Such a system describes the dynamics of the vector $U$ over the time. Here, stability means if the norm of the solution vector $U$ denoted by $\|U\|$ is bounded or if it goes to infinity $\displaystyle\lim_{t\to\infty}\|U\|$? This is a well-stablisehd topic in control thery and differential equations that the stability is linked to the "eigenvalues" of the matrix $A+B$ denoted by $(\lambda_k)_{j=1}^n$. What is important, in fact, is the sign of real parts of eigenvalues, that is:

if $\mathrm{Re}(\lambda_j)>0$ for any $j=1,2,\cdots,n$, the system is "unstable", norm of $U$ increases in time without any bound: $\displaystyle\lim_{t\to\infty}\|U(t)\|=\infty$
if $\mathrm{Re}(\lambda_j)<0$ for all $j=1,2,\cdots,n$, the system is "stable", norm of $U$ is bounded in time: $\displaystyle\lim_{t\to\infty}\|U(t)\|<\infty$

Of course, I knew this famous result. My problem, however, was that I could not really determine the sign of $\mathrm{Re}(\lambda_j)$ for my martix $A+B$ because they were parametric! What I had, in addition, was

Matrix $A$ is Hermitian (or symmetric) and negative-definite.
Matrix $B$ is skew-Hermitian (or skew-symmetric).

As matrix $A$ is Hermitian and negative-definite, we can conclude that the eigenvalues of matrix $A$ are in fact negative, so this part is "stable". So, the question of stability of $U'(t)=(A+B)\,U$ is boiled down to an algebraec question:

Can adding a skew-Hermitian matrix to a stable matrix make the sum $A+B$ unstable?

\[ \underbrace{A}_{\text{stable}} + \underbrace{B}_\text{skew-Hermitian} \Longrightarrow \text{stable?} \] I was digging the literature to answer this question. I found tons of inequalities (partially discussed in Trence Tao's blog) giving bounds on the eigenvalues of matrix summations, but nothing could help me!

Then, I came to the idea of generatring lots of random 2-by-2 matrices with the same properties as I want for $A$ and $B$ to see if $A+B$ is stable or no. And it was always stable! So, I got really suspicous and more motivated to continue my search. One day, google took me to Michele Benzi's website outlining the topic of the course "Matrix Analysis", and something caught my attention: Bendixson's theorem!

I was, of course, curious enough to google for this. I found another paper by Helmut Wielandt repharesing the original result of Bendixson:

And this is what I was looking for: it simply says that the real part of $\lambda_j$'s is bounded by the (real) eigenvalues of the Hermitian matrix ($A$) while the imaginary part is bounded by the (imaginary) eigenvalues of the skew-Hermitian matrix $B$.

This result is obtained by Bendixson in 1901 (for real matrices) and then got extended to complex matrices in 1902 by Hirsch:

My problem, fortunately, got solved by this intersting theorem and ended up as a journal paper. But I am still surprised how much mathematicians are not aware of Bendixson's theorem, even the great Terence Tao (!):

To be continued!

Oct 22, 2020

In the search of the “logistic regression”! (part I)

Unfortunately, if you ask a "data scientist" to explain the real machinery and rationale behind the logistic regression, his answer can hardly convince anyone, even if himself (which I doubt)! They repeat the same "ambiguous" explanations about logit function and logodds and how much it looks like a line! "Look! It is so simple: $\ln(\frac{p}{1-p})=\beta_0+\beta_1 x$". Things even get worse when they want to justify the strange form of the "cost function" (here for example). My first goal in this post is to justify the cost function in a simple, but mathematically valid, language. In the next post, I will explain it from another point of view: maximum likelihood estimation.

The dean of the Department of Economics and Business in a respectable university would like to predict whether an application submitted for their graduate program will be accepted or rejected, based on the applicant’s GPA. She reviews 20 applications from the last year, and she puts her findings in a scatter plot, where 0 and 1 correspond, respectively, to rejection or acceptance of an application. (In French Mention is grade!)

Even though the correlation between GPA and the application decisions is not that strict, she can still say that applicants with higher GPAs “seem more likely” to get accepted. But, rather than such rough judgments, the dean prefers to be more precise and get a model describing this “likeliness” or “probability”, a model which provides the acceptance probability given a particular GPA. What can she do?

Linear regression? Not a good idea!

One may rapidly suggest the linear regression (seriously?!). It looks a bit strange, however, let’s give it a chance. Linear regression will provide this yellow line which gives the line with smallest residual:

One may say that it is better than nothing, and I agree! But notice that the values the linear model predicts are not in the interval $[0,1]$, so one cannot assign them directly to a valid probability between $[0\%-100\%]$.

Transforming the linear regression? Not a good idea neither!

One may think of “transforming” the linear model to models living in $[0,1]$. A very common and well-known trick is to use the logistic function which has a "S"-shaped curve (sigmoid function) : \[y = g(z)=\dfrac{1}{1+e^{-z}}\]

In fact, the logistic function transforms the variable $z$ living in $[-\infty,+\infty]$ to $y$ which lives in $[0,1]$. This property is exactly what we were looking for!

So, it seems that the problem is solved and we can transform the output of our linear regression from $[-\infty,+\infty]$ to $[0,1]$. To do this, we still need the linear estimator $\eta=\beta_0+\beta_1 \,x$ but what we change is the estimated response $\widehat{y}=\eta$, which should be modified as $\widehat{y}=g(\eta)$.

Let’s look at what we get. The red line is the result of applying the logistic function on the yellow line. As we wished, the estimated values are in $[0,1]$ which is great. But does this model make any sense? The yellow line minimizes the residuals $\min\|\eta-y\|$ among all other linear models. But what can we say about the red line? Does it reach the minimum of $\min\|g(z)-y\|$? No! In fact, we cannot find much mathematical sense for this transformed model unless we make a small modification.

Least-square of residuals? We are quite there!
Our persistent dean does not lose her hope and, this time, tries finding the solution of $\min\|g(z)-y\|$. This is what she gets (the red line):

Now, the values are in $[0,1]$ and she is sure that this logistic model is the best model in the sense that it minimizes the norm of residuals. Seems much better, no? But she should not be so excited yet!

Do you remember that for finding the minimum of a function, we find the points where the derivative of the function is zero? In fact, this zero-derivative condition is satisfied not only for the global minimum of the function, which is our objective but also for all local minima and maxima. If the function is “convex”, one can be sure that there is only one minimum, but what if it is “non-convex” (as you see in the picture)?

When you ask your computer to solve the minimization problem, it searches for the points which satisfy the zero-derivative condition, starting from the initial guess you feed it with, and it gets back to you as soon as it finds such a point. But, can one guarantee that what is founded is the global minimum?

Unless the function is convex, this point can even be a global maximum! In fact, for non-convex problems, your computer may get trapped in another point other than the global minimum and consider it as the solution you asked for. This is the drawback of $\min\|g(z)-y\|$; it is non-convex! It would be insightful to check this graphically:

The absolute difference of the estimated response $\widehat{y}_k$ (for every data points $k$) and the true response $y_k$ is called the residual $e_k$. Now, consider three residuals, $e_1$ and $e_3$ when the response is 1 and $e_2$ when it is zero: \[e_1=|1-g(\eta_1)| \qquad \eta_1=\beta_0-0.3\,\beta_1\\ e_2=|0-g(\eta_2)| \qquad \eta_2=\beta_0+0.2\,\beta_1\\ e_3=|1-g(\eta_3)| \qquad \eta_3=\beta_0+0.8\,\beta_1\] For simplicity, we fix $\beta_0=1.5$. So, we plot each residual as a function of $\beta_1$: $e_1^2$ in (red), $e_2^2$ (in blue), and $e_3^2$ (green). The sum $\sum_{k=1}^3 e_k^2$ (cost function) is also plotted in orange, which is clearly non-convex!

Notice that the minimization problem here is much more difficult than the one in the linear regression. There, the regression was linear, so we could employ the linear algebra (QR decomposition) to simplify the problem. Here, the logistic function makes the problem non-linear. So, we do not have any other choice but to use the zero-derivative condition.

But this is not the whole story. Even if God solve this minimization problem for you (which he do not!) it is not plausible to define the residuals similar to linear regression. Here, there are only two possible outputs: 0 (rejection) or 1 (acceptance). If the response is 0 and the estimate is 1, the residual is 1, but this “1” is really huge, it is the largest error one could make, it is not “only 1”! So, it makes sense to define residual differently such that the cost function shows the “huge” difference between the two possible outputs.

Re-defining the residuals? Finally … the salvation!

Rather than the usual residual $e_k=|\widehat{y}_k-y_k|$, we want to define a new residual $\epsilon_k$ to be huge when $e_k=1$. What about this \[\boxed{\epsilon_k=-\ln(1-e_k)}\] As we can see in the graph

if $e=0$ (smallest difference), $\epsilon=0$
If $e=1$ (largest difference), $\epsilon=+\infty$

So, for this new residual, having a difference of 1 is so huge, which is exactly what we were looking for. Then, instead of minimizing $\|e\|$, we consider another minimization problem $\min \sum_k\epsilon_k$. One can show that the function $\min \sum_k\epsilon_k$ convex, so has only one local (and global) minimum. The more common form of writing such a cost function is as follows: \[\boxed{\ell =\sum_{k=1}^m\Big[-y_k\,\ln(\widehat{y}_k)-(1-y_k)\,\ln(1-\widehat{y}_k)\Big]}\]

When $y_k=0$, the residual will be $-\ln(1-\widehat{y}_k)$
When $y_k=1$, the residual will be $-\ln(\widehat{y})k)$

which gives the same things as the other formula $\min \sum_k\epsilon_k$.

Considering the same example as before, we compute the new residuals: $\epsilon_1$ and $\epsilon_3$ when the response is 1 and $\epsilon_2$ when it is zero: \[\epsilon_1=-\ln(1-e_1)\\ \epsilon_2=-\ln(1-e_2)\\ \epsilon_3=-ln(1-e_3)\] So, we plot each residual as a function of 1: $\epsilon_1^2$ (in red), $\epsilon_2$ (in blue), and $\epsilon_3$ (in green). The sum k=1mk (cost function) is also plotted in orange, which is convex!

Let us solve $\min \sum_{k=1}^m \epsilon_k$ and obtain the logistic model as this new red line, which has values in $[0,1]$, whose residuals are based on a correct intuition, and is much easier to be computed due to convexity.

Having probabilities is nice but sometimes we want to be more strict, we want to decide: 0 or 1! In such cases, we often consider the threshold probability of 50% based on the logistic model: if the probability is more than 50% the application will be accepted and rejected otherwise.

Mathematically speaking, this is related to the sign of our linear estimator $\eta=\beta_0+\beta_1\, x$

If $\eta<0$, the probability will be more than 50%
If $\eta>0$, the probability will be more than 50%
If $\eta=0$, the probability is 50%. This gives us the “decision boundary” which is coloured black in the graph for Mention = 17.29983.

I should stop here as this post took too long. But we are not yet finished! In the next post, the dean plans to add another variable to her study, so we will face with the multiple logistic regression. It would be so much fun! We will also travel to 19th century to discover the origins of the logistic regression!

Aug 16, 2020

Student’s t-test: judge the world with a small sample!

Assume that, for some reasons, you weight a can of tuna by kitchen scale and notice that the weight is quite smaller than it should be. As you are a persistent person you keep weighting 6 more cans and the average weight is still far from what is mentioned on the label. So You probably wonder if the company is cheating and if you have found trustable evidence, no?

Notice that your evidence, the can’s average weight, is based on the sample of only 7 cans you could get your hands on. However, what you want to claim is about the average weight of the whole population of tuna cans that the company produces. Are we able to prove this fraud with such a small sample?

This problem is very common in statistics when you need to make a judgement on the population mean based on the sample mean. Such a judgement is ultimately important in practice, as the population is often not accessible or it is quite costly and difficult to do the investigation on. This is the main motivation for the t-test!

Let us first introduce some simple notations:

n: number of samples: X1, X2, ...,Xn
X and s: mean and standard deviation of the sample
μ and σ: mean and standard deviation of the population

Assume that the population is normal, which means that when you repeat your experiment (on n samples) lots of times (k times), and you compute and plot the sample mean $X$ each time, you will end up with the Gaussian (bell-shaped) distribution. The following graph is plotted with k = 20000 and for the average weight of tuna cans as 150gr.

One can prove that

which is a mathematical notation for saying that the sample mean X is normally distributed around the population mean 𝝁 with the standard deviation 𝝈/n^½. One can standardize this using the so-called z-score

which has a normal distribution with zero mean and the unit standard deviation: z~N(0,1)

This seems useful as the famous 68–95–99.7 rule tells us that there is 99.7% chance that:

So, after some simple manipulations, we get an estimate for the population mean:

which is very very likely to be true (OMG 99.7%!).

So, based on one small sample of size n we can estimate 𝝁, and what if the nominal weight on the label does not satisfy this estimate? The company is, very likely, cheating!

t-score

Things seem too good to be true, no? Note that the interesting calculation based on z-score requires having the standard deviation of the population. Do we have it? Often not!

One may wonder to replace with the sample standard deviation s. This is how we end up with the so-called t-value:

The problem is not still solved! The issue is that for z, we had z~N(0,1) based on which we found the final estimate for 𝝁; but, what can one say about t?

It was William Sealy Gosset, an Irish statistician working at Guinness Brewery who addressed this question. He found out that the t-score, unlike the z-score, does not follow the normal distribution but another distribution, which is what we call the t-distribution with 𝛎=n-1 degrees of freedom:

In this graph, we have plotted the normal distribution as well as t-distribution with different sample sizes or degrees of freedom:

As one can see when the size of the sample is small, the t-distribution is quite different from the normal distribution, but as the sample gets larger, the t-distribution gets closer and closer to the normal distribution. They are almost the same for samples of size more than 30.

So, the problem is solved: we know the distribution of the t-score, we can determine the region of 99.7% probability around zero, we do similar calculations as before and we obtain an estimate of the population mean using $X$ and s.

A bit of history

Gosset published his seminal work, “The probable error of a mean” in 1908, but not under his own name! He, instead, used the pen name “Student” and this is why today we call his distribution Student’s t-distribution. This is, happily, much simpler than the name he had chosen: “frequency distribution of standard deviations of samples drawn from a normal population”!

It is not crystal clear why he used a pen name and why he chose “Student”! Perhaps it was due to Guinness Brewery’s secrecy policy. Yet another interesting explanation is that the famous Karl Pearson, the editor of the journal Biometrika in which the paper was published, did not want it to be known that the paper had been written by a brewer!

Usually, people relate the t-distribution to Gosset. However, this distribution had been discovered in Germany, around the same time that Gosset got born (1876)! Friedrich Robert Helmert, in a series of articles (1875-1876) provided correct mathematical proofs of all necessary elements in finding the t-distribution. He only needed a simple calculation to reach what Gosset reached 32 years after him; but, he was not interested! Clearly, his work was unknown to English statisticians, including Gosset, and he had to solve his problem by himself. Lacking Helmert's analytic power, Gosset based his result on two “assumptions” which later proved by another great name in statistics, Ronald Fisher!

Also in 1876, the eminent German mathematician, Jacob Lüroth, had obtained the t-distribution in his paper entitled “Vergleichung von zwei Werten des wahrschein lichen Fehlers“. His work was, however, based on some unexplained assumptions.

What is the “Flatland”?

I took this name from the book “Flatland: A Romance of Many Dimensions”, which is a satirical novella by Edwin Abbott Abbott (published in 1884) criticizing the hierarchy of Victorian culture, using a fictional two-dimensional world or the “Flatland”. The book talks about ignorance the flatland’s inhabitants showed in grasping the existence of the Spaceland or the three-dimensional world. So, to me, the Flatland represents the ignorance and the unwillingness to understand, which is, unfortunately, still the character of our modern world!

P.S. I got to know this book through another amazing book "The Fourth Dimension: A Guided Tour of the Higher Universes" by Rudy Rucker, which I read as a kid. I read the original text much later, when I was doing my PhD!