Nov 28, 2021

The salvation: when I finally understood "dynamic programming"!



For someone without classical education in computer science, like me, dynamic programming interview problems can be like a nightmare! Fianlly, I could found my way to understand these kinds of problems, and this is what I am going to talk about in this post.


.

Acknowledgements: I use hilite to make my python code blocks for this post. So, I owe them a sincere thank-you message! Also, the thumbnail is from Avik Das' post on dynamic programming. I found it when searching for a thumbnail, but the post is truly interesting and well-written. Recommended to read!

During an interview procedure, I received an assignment asking for something like "the minimum number of times one needs to do an experiment to guarantee some value" . I approached the problem like a mathematician (whom I am, roughly speaking!), worked hard on it for one week, and I got my rejection notice, a couple of days after! I asked "But why? What was wrong there?" And the answer was "... the correct answer is X, but your answer is Y. One possible solution is using dynamic programming." Oh! That is it. I don't know about you, but this was the first time I heard the term "dynamic programming". I was wondering to know what is this so-called dynamic programming that failed me to get the job I really liked. I tried different approaches: "Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming" course in Coursera, geeksforgeeks exercises, and HackerRank. They were great but not helpful for me; I still had the problem to understand what should I do with a DP problem.

One night, and out of blue, I found "Dynamic programming for coding interviews", and I immediately read the first 6 chapters until midnight! (Note: The book is not voluminous, around 130 pages.)
After, I could understand the DP approach much better than before, it was like a miracle for me! The reason might be that the book, often, approaches each example in three different (but related) ways:
  • recursive approach: handle one step and delegate the rest back to the function
  • memoization approach: recursive approach + save/cash the result to avoid computing that again in future
  • dynamic programming approach: keep the recursive soul but without recursion, build the solution from bottom to top
Comparing the code of these approaches clarified DP much better than any other thing I have ever read before!

For example, assume that you are asked to compute the \(n\)-th term of the famous Fibonacci sequence \(1,1,2,3,5,8,\dotsc\) with the recursive rule of \[\boxed{f_n = f_{n-1} + f_{n-2}}\] In the recursive approach, we define a function which simply breaks the main problem of computing fib(n) to two smaller problems, by calling itself for computing \((n-1)\)-th and \((n-2)\)-th terms:

1
2
3
4
5
def fib(n):
    if (n==1 or n==2):
        return 1
    else:
        return fib(n-1) + fib(n-2)

This code calls the function fib() several times, which makes it slow and with exponentially high complexity, while lots of these call are repetitive. For example, if the goal is to compute fib(5), we call fib(4) and fib(3). The former calls fib(3) and fib(2) while the latter calls fib(2) and fib(1). This has been illustrated in the following picture (from the book).
So, just in the second "step" of this recursion, fib(3) has been called twice. One can imagine that the number of these repetitions get larger and larger for the next terms of the sequence. This is the fact that makes this code impractical and inefficient.

What if we save somewhere the values that we have already computed, so that we can just read them from memory rather than adding an "overhead" of recursion. This is what is called "memoization". Look how the code looks like in this case:

1
2
3
4
5
6
7
8
9
memo = [0] * 100
def fib(n):
    if memo[n]:
        return memo[n]
    if (n==1 or n==2):
        memo[n] = 1
    else:
        memo[n] = fib(n-1) + fib(n-2)
    return memo[n]

We still have the same recursion but if memo is already filled in, we just read from the memory. This will improve the efficiency a lot and usually it is easy to be done. But we still can do better, the "dynamic programming approach". For DP, we adopt the buttom-top approach, that is, to start from smaller problems in order to solve larger ones.
In this example, it means that we start to compute fib(3), fib(4), ... until we reach the term that we would like to compute. We combine cashing the already-computed results and the recursive code to this very fast DP-based code:

1
2
3
4
5
6
7
def fib(n):
    memo = [0] * 100
    memo[0] = 1
    memo[1] = 1
    for i in range(2, n):
        memo[i] = memo[i-1] + memo[i-2]
    return memo[n-1]

There is no recursive funcion call anymore; it is simply reading from memory! I do not want to give a very accurate benchmarking for the computation time here, but this is what I got on my machine for \(n=35\). You see that the difference is huge!

Approach Executation time (ms)
Recursive 11364
Memoization 0.08
DP 0.03

This example is certainly too simple. One even could think directly of the DP method. But, for better or worse, it is not always like this. There are cases where it is more difficult to imagine the structure of the DP solution right at once. For example, what do you think about this problem from the above-mentioned book?
Hint: Imagine what would happen if you put the first tile vertically/horizontally, and start from the recursive solution.

The rope problem

Now, I can get back to the problem I was working on for my interview assignmenet: the rope problem

You are given \(M\geq 1\) identical ropes, and you are asked to find out the strength of the rope by attaching weights between \(1\) kg and \(N\) kg. If a rope fails during a test of load \(w\), it means that its strength is less than \(w\). The strength is not bigger than \(N\). You need to determine the minimum number of tests that are necessary to find this strength.

The tricky point is that the number of ropes is limited. If we had infinitely-many ropes, the binary search would be the best approach. On the other hand, with only one rope, we would not have any chance to miss, so we had to do a "linear search": to load the rope with weights \(1,2,3,4,\dotsc\) respectively, until it breaks.

At that time, the way I approached this problem was mathematical rather than algorithmic. I tried to find a clever choice of weights to be tested, to minimize the number of tests needed. I still believe that my answer was interesting, and could reduce the exponential complexity of a brute-force search to polynomial. However, it was not what they were looking for!

Let's start to solve this problem with the recursive approach: Given a rope of certain strength bound (\(n\) in the beginning), we do a search over all weights. As we need to consider the most difficult case, we find the maximum of two cases:
  • The rope fails under weight \(w\), then we call back the function setting \(w-1\) for the strengh bound, and reduce the number of available ropes.
  • The ropes does not fail, so we call back the function setting \((n-w)\) as the strengh lenght. Notice that, in fact, the strengh would be between \((w+1)\) and \(n\), but what really matter is the number of weights remains to test, not the upper-bound and lower bound.
We, then, take the minimum number of tests over all different weights that we just searched over.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def testRope(m, n):
    minTest = n
    if m == 1:
        return minTest
    if n <= 0:
        return 0
    if n == 1 or n == 2:
        return n
    for w in range(1,n+1):
        minTest = min(minTest, 1 + max(testRope(m-1,w-1), testRope(m,n-w)))
    return minTest

The code works, nice! But, it does not give me any result for \(n>25\) even after minutes of waiting! So, let's think about the DP approach. We keep more or less the same recursion but not as recalling a function over and over. We adopt the same methodology as in the Fibonacci example; we save the results of smaller problems in a matrix considering that:
  • Testing with one rope is equivalent to the linear search, so testRope(1,n)=1
  • If the strengh bound is 1 or 2, one can clearly see that we need 1 or 2 tests to determine the lenghts, so testRope(m,1) = 1 and testRope(m,2) = 2
  • If there remains only one rope, need to do \(n\) tests (linear search), so testRope(1,n) = n and testRope(m,2) = 2

These three simples rules are essentially lines 5-7 of the code below, where memo is our two-dimensional list saving the computed results. There is just a small tweak here in the indices, due to the zero-based indexing in python.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def testRope(m, n):
    memo = [[0 for i in range(n)] for j in range(m)]
    for i in range(m):
        for j in range(n):
            memo[0][j] = j + 1
            memo[i][0] = 1
            memo[i][1] = 2

    for i in range(1,m):
        for j in range(2,n):
            for w in range(j+1):
                if w == 0:
                    memo[i][j] = min(j+1, 1 + memo[i][j-1])
                elif w == j:
                    memo[i][j] = min(memo[i][j], 1 + memo[i-1][j-1])
                else:
                    memo[i][j] = min(memo[i][j], 1 + max(memo[i-1][w-1], memo[i][j-w-1]))
  return memo[m-1][n-1]

The next step is to solve more complicated problems. namely, with \(n=3\) and 2 ropes. What do we need for solving this? Exactly the simpler problems that we just solved! We test a weight, if it fails we have a linear search to do, if it does not fail, we have a usual search of length 1 or 2, whose solution is known to us! This is exactly what we have in line 17, and we put it in a loop. So, we build the final solution based on these sub-problems.

What about timing here? Let's take a look for the case \(m=2\) and \(n=25\):
Approach Executation time (ms)
Recursive 25700
DP 0.27

Ok! I am finished with this post. I hope if you have not been familiar with DP, this post could have given you a better idea of how to approach it. And don't forget to check out the book. It explains the stuff much better than I did :)

Aug 1, 2021

An intro to NLP (not the one that you think!)



In the data science era, the first thing that may come to your mind when you hear "NLP" is probably the Natural Language Processing, isn't it? I have studied natural language processing to some extent and no doubt that it is a really interesting topic to dive into! But, there is still another (much older) NLP: Non-Linear Programming: a non-linear optimization problem with non-linear (equality or inequality) constraints. In this post, I am going to talk about this NLP but quite briefly. A more in-depth discussion requires writing a book, not a blog post!
.

If you have passed some elementary machine learning or optimization course you should have learned the Gradient Descent Method which is very famous in deep learning these days. But, the GD method (without any modification) is not well-suited for constrained optimization problems, where you need to minimize a function while satisfying a set of other equalities and/or inequalities. Let me give you an example:
  • minimize \(f(x_1,x_2) = -x_1\,x_2\)
  • subject to
    • \(h(x_1,x_2) = 20x_1+15x_2-30=0\)
    • \(g(x_1,x_2) = (x_1/2)^2 + x_2^2-1\leq 0\)
    • \(0\leq x_1,x_2\leq 3\)
In this example, we have three type of constraints: equality constraint \(h(x_1,x_2)=0\), inequality constraint \(g(x_1,x_2)\leq 0\) and bound constraint on design or state variable \(x_1,x_2\in [0,3]\). So, the main question is how to take these constraints into the minimization problem.

Lagrange multiplier method

The Lagrange multiplier method is well-known for equality-constrained optimization. Simply speaking, we minimize \(f(\mathbf{x})-\lambda h(\mathbf{x})\) over \(x\) and \(\lambda\). I am not going to discuss the details, so if you feel like you need more explanation, do not hesitate to watch this video:
If we did not have the inequality constraint, the problem was almost solved with this Lagrange multiplier method. But how to bring that constraint into the play? This is where a very interesting concept of the NLP appears: the slack variable, whose role is to transform the inequality constraint into an equality constraint: we define a new variable \(z\) such that \[g(x_1,x_2)+z^2 = (x_1/2)^2 + x_2^2-1+z^2= 0\] and now we have the equality constraint!

We can then apply the Lagrange multiplier method with this additional slack variable \(z\) and find the Lagrangian as: \[ \mathcal{L}(x_1,x_2,z,\lambda,\beta) = f(x_1,x_2) - \lambda\,h(x_1,x_2) -\beta\, \left(g(x_1,x_2)+z^2\right) \] We use \(\lambda\) for the Lagrange multiplier of the equality constraint and \(\beta\leq 0 \) for the inequality constraint (for a maximizing problem \(\beta\geq 0\))

So, the first-order optimality condition will be obtained as: \[ \partial_{x_1}\mathcal{L} = \partial_{x_1}f-\lambda\,\partial_{x_1}h-\beta\,\partial_{x_1}g=0\\ \partial_{x_1}\mathcal{L} = \partial_{x_2}f-\lambda\,\partial_{x_2}h-\beta\,\partial_{x_2}g=0\\ \partial_{z}\mathcal{L} =2\beta\,z=0\\ \partial_{\lambda}\mathcal{L} = h(x_1,x_2)=0\\ \partial_{\beta}\mathcal{L} = g(x_1,x_2)+z^2=0 \] Note that the slack variable can be easily removed from the system (using the third and last equations) and we are left with: \[ \partial_{x_1}f-\lambda\,\partial_{x_1}h-\beta\,\partial_{x_1}g=0\\ \partial_{x_2}f-\lambda\,\partial_{x_2}h-\beta\,\partial_{x_2}g=0\\ \beta\,g=0\\ h = 0 \] Now we can see that this system is equivalent to applying the Lagrange multiplier method to for \(\mathcal{L} = f - \lambda\,h -\beta\, g\) subject to \(h=0\) and \(\beta\, g = 0\), as if the slack variable has never been existed!


The equation \(\beta\,g=0\) brings up two cases:
  • \(\beta = 0\) and \(g< 0\) (non-binding or inactive constraint)
  • \(\beta\neq 0\) and \(g=0\) (binding or active constraint)

Non-binding constraint case

For the non-binding constraint case, the constraint \(g\leq 0\) is removed completely from the system and we should solve \[ \partial_{x_1}f-\lambda\,\partial_{x_1}h=0\\ \partial_{x_2}f-\lambda\,\partial_{x_2}h=0\\ h = 0 \] or \[ -x_2-20\lambda = 0\\ -x_1-15\lambda = 0\\ 20x_1+15x_2=30 \] which gives the result \[\boxed{x_1=0.75, \qquad x_2=1, \qquad \lambda = -0.05}\] 
Notice that it is the same solution that one could have found naively by substituting \(x_1\) or \(x_2\) in the function \(f\) using the equality constraint (which is a line). However, as one can see in the following picture, the solution is not a feasible solution as it does not respect the inequality constraint:

Binding constraint case

For the binding constraint case, one gets a non-linear system of equations to be solved: \[ -x_2-20\lambda-0.5\,\beta\,x_1 = 0\\ -x_1-15\lambda-2\,\beta\,x_2 = 0\\ 20x_1+15x_2=30\\ 0.25x_1^2+x_2^2=1 \] which can be solved, for example, using fsolve function in scipy.optimize, resulting in \[\boxed{ x_1 = 0.81511541, \qquad x_2=0.91317946, \qquad \lambda = -0.04391383, \qquad \beta = -0.08563926}\] which satisfies all the constraints, also \(\beta<0\).

Closing remarks

  • A very similar procedure will be applied in the general case with several equality and inequality constraint, under the name of Karush–Kuhn–Tucker conditions.
  • One can also take the design variable bounds into the Lagrangian, for example check out the Interior-point or barrier method.
  • For numerical purposes, rather than handling all different binding and non-binding cases separately, one would use, for example, Exterlor Penalty Functlon method or Augmented Lagrange Multiplier method, which adds a penalty term to the main minimization problem and enforces respecting the constraints, like: \[ \alpha_h \Big[\sum_k h_k(\mathbf{X})^2\Big]+\alpha_g\Big[\left(\max(0,g(\mathbf{x})\right)^2\Big] \] for some positive constants \(\alpha_h\) and \(\alpha_g\), which penalizes non-zero values of \(h\) and positive values of \(g\) over the discretized domain indexed by \(k\). Then, an unconstrained optimization method (such as GD) can be used: This is for example the case in minimize function in scipy.optimize. In the jupyter notebook you can see that the result is the same as the "analytical" one.

Jul 26, 2021

On the "Gershgorin circle theorem": How to bound the eigenvalues beautifully!

Gershgorin circle theorem is really simple despite its fancy name. I got to know this theorem during my PhD, when I was working on the solvability of numerical methods, and I needed to show that the spectrum of the system I had to solve did not contain the origin. Here, I give a very brief introduction to this thereom.
.

Let's start from the end, the theorem! Gershgorin circle theorem says that, given a square matrix A, no eigenvalue can be located outsides a finite number of discs.

More precisely, for every row \(i\), we consider the disc \(D(a_{ii},R_i)\) with
  • center \(a_{ii}\) (the diagonal entry of \(i\)-th row)
  • radius \(R_i:=\sum_{j\neq i }|a_{ij}|\) (sum of off-diagonal entries)
Then, every eigenvalue lies within at least one of discs \(D(a_{ii},R_i)\).

Consider a very simple case of a \(3\times 3\) matrix: \[A = \begin{bmatrix} 2 & 0.5 & 0.5\\ 0 & 5 & 1\\ 1 & 1 & 8\\ \end{bmatrix}\] We can easily see that there are three discs: \[D(2,1), \quad D(5,1), \quad D(8,2)\] which contain the eigenvalues \(1.95, 7.76, 5.3\): (you can find the jupyter notebook here)
The proof of this theorem is quite easy. One just needs to use the definition of the eigenvalue \(Ax=\lambda x\) and the triangle inequality to show that \(\boxed{|\lambda-a_{ii}|\leq R_i}\). The way that Semyon Aronovich Gershgorin proved it in 1931 is a bit different though, but similar.
He first started from another ansatz which was an extension of the work of Levy:
Then, based on this, he stated his own beautiful result:
Note that this bound gets better as the off-diagonal part gets smaller (so smaller radius). The limit case is a diagonal matrix!

Brainteaser:

Now, can you answer why diagonally dominant matrices are really desired in numerical analysis? ;)

Mar 4, 2021

PCA or SVD? Are they really different?

I have been educated as an applied mathematician, where SVD was in our DNA! So, quite naturally, when I first heard people, mostly in data science, talked about PCA I got hurt! "Oh! It is SVD, do not call it PCA!" As a matter of fact, I was right, strictly speaking, however, the whole story is more interesting that just a name. Here, I talk about SVD, PCA, their origins, and more!
.

What is SVD (Singular Value Decomposition)?

SVD is a matrix factorization in which the rectangular matrix \(X\in\mathbb{R}^{m\times n}\) is decomposed as the multiplication of two square unitary matrices \(U\in\mathbb{R}^{m\times m}\) and \(V\in\mathbb{R}^{n\times n}\) and the diagonal matrix \(\Sigma\in\mathbb{R}^{m\times n}\): \[\boxed{X = U\,\Sigma\,V^t}\] The singular values are really important as their absolute value implies the importance of the corresponding columns of \(V\). Noticing that this matrix \(X\) can be seen as a linear transformation from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). So, when we have the matrix-vector multiplication \(X\xi\), the role of each term can be interpreted as below:
  • Columns of \(V\) give an orthonormal basis for \(\mathbb{R}^m\). So \(V^t\xi\) is a change of basis from the standard basis to the basis given by \(V\).
  • \(\Sigma\) has non-zero entries only on its diagonal whose role is to scale each of the components of \(\xi\), already projected to the new basis. It also adapts the dimension, for example by chopping off extra columns (when \(m\leq n\)) or extra rows (when \(m\geq n\)) of \(V^t\xi\).
  • \(U\) does a similar projection as of \(V\) but in \(\mathbb{R}^m\)
If you would like to have more detailed explanation, you can simply consult this great course of Gilbert Strang:
SVD has been derived by Beltrami (1873) and Jordan (1874), for bilinear forms.

What is PCA (Principal Component Analysis)?

It was Karl Pearson in 1901 who founded PCA, in a paper titled "On Lines and Planes of Closest Fit to Systems of Pointsin Space.". In fact, he was trying to answer a very interesting question: How to represent "a system of points" in a high-dimensional space \(\mathbb{R}^d\), by the "best-fitting" straight line or plane? He defined the "best" line as the line whose mean squared distance to all points in the smallest. Using the notation of correlation, he changed the problem to finding the ellipsoid (or "ellipsoid of residuals") with the largest axis:
So, he found one direction in the data which can best explain the data, and another direction (perpendicular to the first one) which is the worst direction to explain the data. Note that theses direction (or lines) are not necessarily along the Cartesian axes.

This method often used to find the "best" low-dimenstional approximation of a high-dimensional data; to ignore the least important data and just to keep the data which really matter.

So, how do they relate to each other?

Feb 7, 2021

What is an Asymptotic Preserving (AP) scheme? A conceptual journey!

The term of Asymptotic Preserving has been coined by Shi Jin, a well-known Chinese mathematician. However, it has been rooted in USSR (as lots of other things in mathematics). In this post, I am going to review AP schemes, in a simplified language..

Let's right start with an example: Assume that you want to solve this simplified advection-diffusion equation: \[\nu\,u''+ u'=0\qquad \text{ over } [0,1] \quad \text{such that }\quad u(0)=0, u(1)=1,\] with two Dirichlet (fixed values) boundary conditions. One can simply confirm (believe me, it's really simple!) that the function \(u(x)=\dfrac{1-e^{-x/\nu}}{1-e^{-1/\nu}}\) satisfies this equation. If we plot this exact solution with different diffusivity constant \(\nu\), we will get this:
As one can see, the solution develops a so-called "boundary layer" near the left boundary, which gets sharper and sharer as \(\nu\to 0\). So far so good, hein? But what happens if you want to obtain this solution numerically? This is what Il'in tried to do in 1969, in a paper titled "Differencing scheme for a differential equation with a small parameter affecting the highest derivative", and he found out that the answer is not that easy!
First, he tried a very simple finite difference discretization of the second and first-order terms, and he got the numerical solution as \(u^h_h=\dfrac{1-\left(\frac{2\nu-h}{2\nu+h}\right)^k}{1-\left(\frac{2\nu-h}{2\nu+h}\right)^N}\) which gives the following plot with \(h \approx 0.01\):
The solution does not make any sense when \(\nu\ll h\). It is not crystal clear to me the method he used here (old Soviet style paper-writing!) but one may guess that a second-order central difference for the advection term would produce a similar effect, which you can confirm in my jupyter notebook.

Investigating this issue, he then succeeded to find another discretization (probably upwind method) which provided a reasonable solution \(u^h_k=\dfrac{1-\left(\frac{\nu}{\nu+h}\right)^k}{1-\left(\frac{\nu}{\nu+h}\right)^N}\):
In summary: Depending on the discretization one uses for the each term, the numerical solution may get inaccurate (or even unstable) if \(\nu\) gets much smaller than your grid size! In other words, the solution may look good for \(ν=1\) but it is not uniformly accurate when \(ν\to 0\).

This paper, as well as a couple of others published in USSR, founded the methods that are known today as AP or Asymptotic Preserving methods. As the name implies, AP methods are defined to "preserve the asymptotic". To understand it better, consider the famous AP diagram:
\(\mathcal{M}^\varepsilon\) is the \(\varepsilon\)-dependent equation we want to solve, like the advection-diffusion example. We know that as \(\varepsilon\to 0\) (or \(\nu\) in our example) this equation will converge to its asymptotic limit \(\mathcal{M}^0\). Now, consider \(\Delta\) as the discretization parameter (like grid size \(h\) in our example). Then, the numerical deiscrete model \(\mathcal{M}^\varepsilon_\Delta\):
  • Converges to the continuous equation \(\mathcal{M}^\varepsilon\), for a fixed \(\varepsilon\) when the grid gets refined \(\Delta\to 0\). 
  • Converges to the numerical asymptotic solution \(\mathcal{M}^0_\Delta\), for a fixed discretization when the parameter vanishes \(\varepsilon \to 0\)
As we observed in the advection-diffusion example, the \(\mathcal{M}^\varepsilon_\Delta\to \mathcal{M}^0_\Delta\) convergence did not hold unless the grid get refined as \(\varepsilon\to 0\). This poses a big issue in practice, as refining the grid increases the computational cost of the method. Sometimes, this convergence holds, but \(\mathcal{M}^0_\Delta\) is not consistent with the continous asymptotic limit \(\mathcal{M}^0\) perhaps because of an artificial \(\mathcal{O}(1/\varepsilon)\) diffusion term which does not make the asymptotic method unstable but incosistent with the real model. One can still find lots of other interesting examples to explain AP schemes. I did a brief introduction here with another example of van der Pol system in my SlideShare:

Jan 11, 2021

12 Rules for Life (A summary of Jordan Peterson's book)

Jordan Peterson is a difficult person to talk about, as he has lots of haters and lovers. Here, for better or worse, I do not plan to talk about his ideas like gender inequality.  My aim to so much more humble!




I heard of Peterson about 2 years ago for the first time and my first reaction was to search for his videos on YouTube. I found lots of debates, lots of videos of his courses in the university, and so on and so forth. I found his way of talking charming and beautiful in terms of the language he was employing (note that I am NOT a native English speaker!). I believe that writing speaks better than speaking; it transfers the message more clearly while giving more weight to the content than the language. You may fool people by talking in beautiful English but not that easily in writing. So, I searched for his books and started with “12 Rules for Life”.

To be honest, I loved the book! It does not mean that I confirm  100% of the contents. It does not mean that I do not find it strange that he only blames Mao and Stalin but not Lyndon B. Johnson for all he did in Vietnam. I felt close with his 12 rules and I liked the way he tries to justify it based on literature, bible and sometimes science. A friend of mine told me recently that this book is pseudo-science, and does not offer anything better than the cheap "road to success" books in the market. I agree that is it not science, strictly speaking, but this is not the point! If you give a rule for life, the relevant question to ask is if the rules are working or no, if by applying them you would live a better and happier life or no. If your answer is yes, I believe that almost no one cares whether the rules are scientifically-proven!

Anyway, with this small introduction, I will make a summary of each chapter here, in this document (updating continuously). My goal is to collect all the important points altogether. I review the chapters one-by-one and share my summaries here with you, and you can follow me on this journey!