Maximum Likelihood Estimation (MLE) in Python

Post Views: 1,188

In this tutorial, we will learn about a very important topic often used in statistics: Maximum Likelihood Estimation. Firstly, we will explore the theory and then will apply our theoretical knowledge through Python.

What is MLE?

Maximum Likelihood Estimation (MLE) is a method for estimating the parameters of the probability distribution. For example, take normal probability distribution, and its parameters are mean and standard deviation. The width and shape of the probability distribution will be dependent on these parameters. Given the dataset, which contains random data, how will you calculate these parameters? That’s where estimation plays a role.
Generally, estimations like sample mean estimation or sample standard deviation estimation are used, which calculate the mean and standard deviation directly from the dataset. But here comes the problem: these estimations assume that the dataset variables are in a linear relationship to each other, which is not the case most of the time. Also, the value of parameters from these estimations doesn’t guarantee maximum accuracy. So, here comes our hero, MLE. It works on the simple concept that chooses the values of parameters based on the dataset that is most probable under the model. To find how probable the given data is under different parameters, the likelihood function is used.
If you still have problems interpreting, consider this example: Take a coin and toss it 10 times. We don’t know whether the coin is biased or unbiased, so we don’t know the probability of getting a head or tail. Suppose the head comes 6 times out of 10. Now, our likelihood function will tell how likely it is to observe this scenario for different values of probability of getting heads.

Mathematics

Suppose you have a dataset, X = {x₁,x₂,x₃ …… x_n} consisting of n independent and identically distributed observations for some probability distribution having parameter Θ. The likelihood function L(Θ|X) is defined as

where P(x_i|Θ) is the probability of observing xi given the parameter Θ. For continuous probability distributions, this probability is replaced by the probability density function (pdf), and the product becomes the product of densities. We take the log-likelihood function because the product of densities will be very small.

Now, since the logarithm function is strictly increasing, it will not impact the likelihood function when we maximize it. We can simply maximize the log-likelihood function with respect to Θ and still get the same estimated parameters.

Take the example of the normal probability distribution function:

Python code: Maximum Likelihood Estimation

Let’s estimate the parameters of the normal probability distribution function. First, we will generate the data in accordance with the normal probability distribution function and then we will estimate the parameters.
Make sure you have scipy library installed in your system. If not, install it from the below command:

pip install scipy

import numpy as np
from scipy.optimize import minimize
from scipy.stats import norm

# Generating sample data according to normal pdf
np.random.seed(0)
true_mu = 0
true_sigma = 1
data = np.random.normal(true_mu, true_sigma, 1000)

# Defining the likelihood function

def neg_log_likelihood(params):
    mu, sigma = params[0], abs(params[1])
    likelihood = -np.sum(norm.logpdf(data, mu, sigma))
    return likelihood

# Initial guesses for mu and sigma
initial_guess = [0, 1]

# Minimize the negative log-likelihood
result = minimize(neg_log_likelihood, initial_guess, method='L-BFGS-B')

estimated_mu, estimated_sigma = result.x
print(f"Estimated mu: {estimated_mu}")
print(f"Estimated sigma: {estimated_sigma}")

Output

Estimated mu: -0.04525564977107271 
Estimated sigma: 0.9870333422579031

Code Explanation

Since the scipy library doesn’t have a built-in function to maximize a function, we have converted the log-likelihood function into a negative and will minimize it. In other words, we are maximizing the log-likelihood function. We have considered initial guesses for mean and standard deviation as 0 and 1, respectively.
The L-BFGS-B is an optimization method used when we have to either maximize or minimize non-linear functions.