Overview: Generally speaking, Variational Inference(VI) is a specific Statistical inference method that attempts to find the best tractable distribution to replace the intractable distribution when the unknown density function is intractable. In this way, a typical inference problem can be converted to a typical optimization problem. The best tractable distribution we look for is the solution to that optimization problem. The definition of tractable and intractable distribution will be explained in the following paragraph. To thoroughly understand VI, you need to understand the following concepts one by one: 1. Entropy 2. Cross-Entropy 3. Relative-Entropy 4.KL-Divergence. Now, let’s go through some basic questions to thoroughly understand Variational Inference.
Guess Ball Colors: Before we dive into math, let’s play a simple game. There are some balls with four colors red, green, blue, black in a box. One ball is randomly selected from the box and you need to guess what the color is of that selected ball by asking several questions. What is your best strategy to figure out the correct color with the least attempts?
Scenario1: Assume you have been told that the proportion of each color is equal. What is your guess strategy?
My strategy is very simple. I only need to ask two questions as shown in the following picture to guess out the correct color. Fortunately, it turns out that my strategy is one of the best strategies and the minimal number of questions you need is two with the given prior information.
What if we use another strategy as follows?
The first strategy has a deterministic number of questions you need to ask. No matter what the color is, we will always need to ask two questions to get the correct color. However, the number of questions we need to figure out the correct color in the second strategy is up to the selected color. Specifically, if the selected color is red, we need only one question to figure it out. But if the selected color is black, we need to ask three questions to get the correct color. Here comes an issue, how to compare the first strategy and the second one when their numbers of questions are not both deterministic?
In statistics, we can use the Expectation value to measure the average number of questions in our strategy. Thus, to compare the first strategy and the second strategy is equal to comparing their expectation values.
Apparently, the first strategy is better than the second strategy because of the smaller expectation value of the number of questions we need. Here I come up with a new issue- How do you make sure the strategy is optimal and is there any way to figure out the minimal number of questions we need to guess out the color?
The answer is Yes! The key to address this issue is Entropy!
Q1. What is Entropy?
Simply speaking, the Entropy value is just equal to the Expectation of the number of questions we need with the best strategy. Pay attention that it is the best strategy instead of any non-optimal strategies such as the second strategy I posted above. The formula of Entropy is as follows. So far, we can figure out the minimal number of questions we need as long as the color distribution is given. However, what is the definition of Entropy and what is it used for?
Entropy a.k.a Information Entropy is a specific concept in Information Theory. It represents the uncertainty of a system. Such as the uncertainty of the selected color in this game. The bigger the Entropy value, the more uncertain the system is. Therefore, the smallest Entropy value should be zero, which means the system is absolutely certain. For instance, if all balls in the box are red, the minimal number of questions you need to ask is zero. Because you don’t even need to guess, any selected ball will be red since it is an absolutely certain thing. Oppositely, with the number of distinct colors in the box increasing, the uncertainty of the selected color will increase, which means the Entropy of the game will also increase. From another angle, the Entropy also represents how many efforts at least you need to eliminate the uncertainty of a system. In our case, the efforts are just the questions we need. That’s why the Entropy value is equal to the minimal number of questions to win the game above. Entropy actually has different versions. The typical Entropy with base 2 log is called bits which is the most common Entropy. The Entropy with base natural e is called nats and with base 10 is called dits instead.
Q2. What is Cross-Entropy?
So far we have known that the Entropy is the number of questions(efforts) we need to guess out the selected color with the best strategy. Actually, there is another similar terminology ‘Cross-Entropy’ represents the number of questions(efforts) we need to guess out the selected color with any arbitrary strategy.
Cross-Entropy is determined by two distributions. One is the distribution of colors in the box, another is is the distribution of the number of questions we need to guess out the selected color which is noted as qi. Therefore, The smaller the Cross-Entropy the better your strategy is. The smallest Cross-Entropy is approached when your strategy distribution is exactly the same as the color distribution where pi=qi. So the smallest Cross-Entropy is equal to the value of Entropy. Here comes up a very intuitive idea, is there any specific terminology for the value of Cross-Entropy minus Entropy? Since this value is always nonnegative, we can think of it as a sort of distance between two strategies. Does this idea make any sense? Yes, this nonnegative metrics or value is just KL-Divergence!
Q3. What is KL-Divergence?
Reconsider the ball color guess game. The color distribution is also called real distribution noted as p. The distribution of the number of questions we need is also called strategy distribution noted as q. The formal mathematical definition of KL-Divergence a.k.a Relative Entropy is as follows:
KL(p||q) is the difference or distance between the cross-entropy H(p, q) and entropy H(p) assuming p is the real distribution(color distribution in this case). Oppositely, KL(q||p) is the difference or distance between the cross-entropy H(q, p) and entropy H(q) assuming q is the real distribution. So far it is not difficult to figure out KL(p||q) and KL(q||p) based on the above equations. However, what is the best strategy when q is the color distribution? The answer is as follows!
Scenario2: Assume you have been told that the proportion of each color is Red=1/2, Green=1/4, Blue=1/8, Black=1/8. What is your best guess strategy?
Intuitively, let’s still leverage the two strategies above and figure out their expectations of the number of questions. Notice that in this scenario the color distribution is 1/2:1/4:1/8:1/8 instead of 1/4:1/4:1/4:1/4.
Amazing! in this scenario, the first strategy is not the best strategy anymore. It is even not as good as the second strategy. The second strategy only needs 1.75 guesses to find the selected color, but the first strategy still needs 2 guesses instead. Here comes a good question, is the second strategy the best? The answer is also amazing: Yes!
As I mentioned in the Entropy section above, the minimal number of questions we need to find the selected color is equal to the Entropy. Thus, as long as we figure out the Entropy we will know if the second strategy is optimal.
Apparently, the second strategy has already approached the minimal number of questions 1.75! Now let’s make a conclusion for this Chapter:
Q4. What is Variational Inference?
Finally, it is time to talk about Variational Inference. In Bayesian Statistics, the inference in probabilistic models is often intractable. We can use a family of tractable distribution to approach the intractable distribution. By searching the distribution that is most similar or closest to the intractable distribution, we convert the inference problem to an optimization problem. This process is called variational inference. The metrics to represent the distance or dissimilarity between two distributions is just KL-divergence.
Usually, in a Bayesian model, we have the observed variable X, the latent variable Z. The posterior is as following:
However, the marginalization over Z in the dominator is typically intractable. Therefore, the posterior is also intractable. VI attempts to use distribution q(Z|v) to replace P(Z|X) by searching the best parameter v of q to minimize the KL-divergence.
The following picture intuitively shows the above optimization process.
Q5. How to find the best distribution to replace the posterior?
Apparently, based on the above picture, the optimization problem is just finding a solution with the smallest KL-divergence. However, the issue is that the above KL-divergence still contains the intractable part p(z|x). So it is difficult to solve the optimization problem directly. Fortunately, we can solve the above optimization problem by converting minimizing KL-divergence to maximizing Evidence of Lower Bound(ELBO) which does not contain any intractable part because of the following formula: