1. Introduction

In this tutorial, we’ll explain the mathematics and intuition behind the family of Beta distributions in statistics and analyze their shapes.

2. Intuition

Let’s say we flip a fair coin 10 times and bet on tails with our friend. Since heads and tails are equally likely each time, our win probability in each toss is 1/2.

Then, the number of tails in 10 flips follows the binomial distribution centered at (1/2) * 10 = 5:

Binomial distribution

From there, we can derive how much we can expect to win in this game and decide how much money to bet.

However, what if the question is reversed? We flip a coin 10 times and get eight wins. We may be delighted with the financial gain, but our friend, not so much, so we get accused of using a biased coin.

To resolve the dispute, we must determine the coin’s inherent probability of landing on its tails in a random toss. This is precisely what Beta distributions can do.

The Beta distribution with parameters \boldsymbol{a} and \boldsymbol{b} shows how much each \boldsymbol{x \in [0, 1]} is likely as the success probability, given that there were \boldsymbol{a-1} successful and \boldsymbol{b-1} unsuccessful trials.

3. Density

Beta-distributed random variables are defined over [0, 1] and have the following density:

    [f(x; a, b) = C x^{a-1} (1-x)^{b-1} \quad 0 \leq x \leq 1 \text{ and } a,b > 0]

The constant C ensures that the cumulative distribution function is 1 for x=1:

    [1 = \int_{0}^{1}f(x; a, b)dx = C \int_{0}^{1} x^{a-1} (1-x)^{b-1} dx = C \cdot B(a, b) \implies C = \frac{1}{B(a, b)}]

where B(a, b) is the beta function:

    [B(a, b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a + b)} \qquad \Gamma(u) = \int_{0}^{\infty}t^{u-1}e^{-t}dt]

Therefore, the density is:

    [f(x; a, b) = \frac{1}{B(a, b)}x^{a-1}(1-x)^{b-1}]

3.1. Why Is There -1?

Essentially, the -1 in the exponents comes from the -1 in the integrand of the Gamma function.

We can try to find some intuition in it using the measure theory.

The CDF of the Beta distribution with parameters a and b is:

    [F(t; a, b) = \int_{0}^{t}f(x; a, b)dx = \int_{0}^{t}\frac{1}{B(a, b)}x^{a-1}(1-x)^{b-1}dx]

Let’s move -1 to the denominator:

    [F(t; a, b) = \int_{0}^{t}\frac{1}{B(a, b)}\frac{x^{a}}{x}\frac{(1-x)^{b}}{1-x}dx = \int_{0}^{t}\frac{1}{B(a, b)}x^{a}(1-x)^{b}\frac{dx}{x(1-x)}]

Now, we have:

    [\frac{dx}{x(1-x)} = d\left( \log \frac{x}{1-x} \right)  = d\mu(x) \text{ for } \mu(x) = \log\frac{x}{1-x}]

As a result, we can transform the CDF to:

    [\int_{0}^{t}\frac{1}{B(a, b)}g(x; a, b)d\mu(x) \quad g(x)=x^{a}(1-x)^{b}]

where the density g doesn’t have -1 in the exponents, and a and b are the numbers of successful and unsuccessful trials.

Intuitively, if we weigh each probability x with the logarithm of the corresponding odds ratio, we can use this interpretation of a and b. In more technical terms, the density g is defined with respect to the measure \mu.

3.2. Non-Integer Parameters

The parameters a and b can be non-integers. However, the intuitive explanation was that a-1 and b-1 denote the numbers of successful and unsuccessful trials (or a and b if we use the measure \mu). How do we interpret a fractional a or b?

Sometimes, the boundary between success and failure is clear-cut. For example, an experiment (a trial) can have several goals. Achieving some while failing at others constitutes partial success. To allow for this nuanced approach to evaluation, we use non-integers a and b.

4. Properties

Let’s now check some properties of this distribution family.

4.1. Mean

The mean of a Beta distribution with parameters a and b is:

    [\int_{0}^{1}xf(x; a, b) dx= \frac{1}{B(a, b)}\int_{0}^{1}x^{a}(1-x)^{b-1}dx = \frac{B(a+1, b)}{B(a, b)}]

To simplify the expression, we’ll write B(a, b) using the Gamma function \Gamma and note that \Gamma(u+1) = u \Gamma(u):

    [\frac{B(a+1, b)}{B(a, b)} = \frac{\frac{\Gamma(a+1) \Gamma(b)}{\Gamma(a+b+1)}}{\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}} = \frac{\frac{a \Gamma(a)}{(a+b)\Gamma(a+b)}}{\frac{\Gamma(a)}{\Gamma(a+b)}} = \frac{a}{a+b}]

If \boldsymbol{a=b}, the mean is 1/2. If \boldsymbol{a>b}, the distribution’s center is shifted to the right and to the left if \boldsymbol{a<b}. 

This has an intuitive explanation. If there are many successful outcomes, it makes sense to believe that the probability of success is higher and vice versa.

4.2. Variance

We can similarly compute the variance:

    [\frac{ab}{(a+b)^2 (a+b+1)}]

The larger a and b, the smaller the variance. That is also intuitive. The more experiments we conduct, the more we know about the success probability, so the distribution we use as its model should be less variable.

4.3. Skewness

The skewness of a distribution quantifies its deviation from symmetry. In the case of a beta distribution with shape parameters a and b, the skewness is:

    [\frac{2(b-a)\sqrt{a+b+1}}{(a+b+2)\sqrt{ab}}]

So, for a=b, the distribution is symmetric, right-skewed for b>a, and left-skewed for a<b.

This also has an intuitive explanation. If the number of successes equals the number of unsuccessful trials, there are no grounds to believe the true success probability is more likely to be > 1/2 than < 1/2, and vice versa. A symmetric distribution fits this assertion.

By the same logic, if a>b, successful trials are a majority, so it’s reasonable to believe that the true success probability is > 1/2. The right model for this assertion is a distribution centered around a value > 1/2. However, the remaining tail stretching to 0 makes the distribution left-skewed. The converse holds for a<b.

4.4. Kurtosis

The formula for the excess kurtosis is a bit more complex:

    [\frac{6\left((a-b)^2(a+b+1) - ab(a+b+2) \right)}{ab(a+b+2)(a+b+3)}]

Negative values indicate tails lighter than those of the normal distribution, and positive values indicate heavier tails. The exact effect on the shape depends on the values of other moments (that are, in turn, defined by a and b).

4.5. Mode

The mode of a distribution is its most likely value, i.e., the value with the highest density.

So, to compute it, we need to find x \in [0, 1] that maximizes the density f(x; a, b). Setting the first derivative of f(x; a, b) to zero and solving for x, we get that the mode is:

    [\frac{a-1}{a+b-2}]

For a symmetric distribution, a=b, and the mode is equal to the mean:

    [\frac{a-1}{a+a-2}=\frac{a-1}{2(a-1)}=\frac{1}{2}]

5. Shapes

Depending on the values of a and b, the Beta density can take many shapes.

5.1. Symmetric Shapes

Symmetric shapes have a=b, and we differentiate between three cases:

Symmetric Beta distributions

The special case a=b=1 corresponds to the uniform distribution.

If a, b < 1, the distribution is U-shaped, and if a, b > 1, it’s bell-shaped and approaches the normal distribution as a and b increase:

Apprachin normality

There will be two inflection points if a, b > 2.

5.2. Asymmetric Shapes

For asymmetric shapes, b>a corresponds to right-skewed, and a<b to left-skewed distributions.

If both a, b < 1, the distribution will be convex, approaching an L-shape (reversed or not) as the larger parameter approaches 1:

a < b <= 1 or b < a <= 1

If a, b > 1, the distribution will be unimodal, and the tail heaviness will decrease as the parameters’ difference grows. There will be one inflection point if one parameter is >2 and two inflection points if both are >2:

a, b > 1

If a < 1 and b > 1 or if a > 1 and b < 1, the shape will be convex or with one inflection point:

(a<1 and b > 1) or (a>1 and b<1)

The inflection point will be there if 1< a, b < 2.

The last remaining cases are a=1, b > 1 and a>1, b=1:

(a=1 and b>1) or (a>1 and b=1)

We have a straight line if the larger parameter equals two, a concave curve if it’s <2, and a convex one if it’s >2.

6. Conclusion

In this article, we discussed the family of Beta distributions in statistics. These distributions are defined over [0, 1] and can take many shapes, making them suitable for modeling normalized quantities (such as probabilities).


原始标题:Beta Distribution