Statistics Module in Python with Examples

Post Views: 1,320

Hello everyone, In this tutorial, we’ll be learning about Statistics Module in Python which provides many functions to perform the various statistical operations on the real-valued numerical data like finding the mean, median, mode, variance, standard deviation, etc. As this module is inbuilt, therefore, we don’t need to install it. Let us start this tutorial by importing the required modules.

Statistics Module in Python

Our first step is to import the module so that we can work with this.

Importing Modules

For statistical operations, we are using the statistics module and we need another module if we want to work with fractions, for that we’ll be using the Fraction module from fraction library.

import statistics
from fractions import Fraction as F

Calculating Mean using Statistics Module

In this section, we’ll be finding how to calculate various means of our data which includes arithmetics mean, geometric mean, etc. Let us look at each of them.

Arithmetic Mean using mean() function

It is the mean or the average that we generally calculate on our data points simply by dividing the sum of all data points by the total number of data points. for example, for 3 data points a, b, c we have an arithmetic mean as

A.M. = (a + b + c)/3

int_list = [54,24,36.09,55.37,92] # int and float types
f_list = [F(1,2),F(3,4),F(5,7)] # fraction values (Num.,Den.)

print("A.M. of int_list is: ",statistics.mean(int_list))
print("A.M. of int_list is: ",statistics.mean(f_list))

The output of the above code is

A.M. of int_list is: 52.292
A.M. of int_list is: 55/84

Harmonic Mean using harmonic_mean()

It is the reciprocal of the arithmetic mean of the reciprocals of the data. for example, for 3 data points a, b, c we have harmonic mean as

H.M. = 3/(1/a + 1/b + 1/c)

The harmonic mean of the data points we have discussed in arithmetic mean will be.

print("H.M. of int_list is: ",statistics.harmonic_mean(int_list))
print("H.M. of f_list is: ",statistics.harmonic_mean(f_list))

H.M. of int_list is:  42.799579237355836
H.M. of f_list is:  45/71

Geometric Mean using geometric_mean()

This type of mean shows us the central tendency of the data points we have and is calculated using the product of n data points and the n^th root of the resultant. For example for 3 data points a, b, c, we have the formula for geometric mean as ³√(a * b * c).

print("G.M. of int_list is: ",statistics.geometric_mean(int_list))
print("G.M. of f_list is: ",statistics.geometric_mean(f_list))

Note: This function is made available from Python version 3.8.

Floating-Point Arithmetic Mean using fmean()

It is similar to mean() but faster and always returns the output in a floating-point type.

print("fmean() of int_list is: ",statistics.fmean(int_list))
print("fmean() of f_list is: ",statistics.fmean(f_list))

Note: This function is made available from Python version 3.8.

Calculating Median using Statistics Module

In this section, we’ll be finding how to calculate the median of our data that is the mid-value of our data points.

Actual Median using median()

This function will give us the actual median of the data points. By actual, We means that the data point may or may not be in the data list.

list_1= [10,20,30,40,50] 
list_2 = [10,50] 

print("median of list_1 is: ",statistics.median(list_1))
print("median of list_2 is: ",statistics.median(list_2))

The output of the above code will be

median of list_1 is:  30
median of list_2 is:  30.0

Note: Median value depends on the lowest and highest value in our data points.

Low Median and High Median

The low median is the value from the data points that is just lower than the actual median of the data. Similarly high median is the value that is just higher than the actual median. Both are one of the values from actual data and is not an interpolated one.

Calculating Low median using median_low()

print("Low median of list_1 is: ",statistics.median_low(list_1))
print("Low median of list_2 is: ",statistics.median_low(list_2))

Running the above code will give output as-

Low median of list_1 is:  30
Low median of list_2 is:  10

Calculating High median using median_high()

print("High median of list_1 is: ",statistics.median_high(list_1))
print("High median of list_2 is: ",statistics.median_high(list_2))

This code will generate the following output.

High median of list_1 is:  30
High median of list_2 is:  50

Calculating Mode using Statistics Module

The mode is the most common element(s) from discrete or nominal(non-numeric) data. If the frequency of occurrences of two or more elements is the same and maximum then the first one encountered will be the mode of the data.

Mode using mode()

This function takes the data and returns a single value which is the mode. If we use Python version < 3.8, then this function will throw an error if the data contains more than one mode. See the example below.

s_mode = [0,2,2,4,1,5,5,5,0] 
print(statistics.mode(s_mode))

The output of the above code will be.

MultiModes using multimode()

This function will return all the modes from a data unlike mode() which only returns a single mode. This function is new in Python version 3.8. Try to run the code below where we will find multimode of a nominal list.

lst_mode = ['a','b','c','b','c','c','b','a','d','z']
print(statistics.multimode(lst_mode))

Calculating Measure of Spread using Statistics Module

These functions calculate a measure of how much the population or sample tends to deviate from the average value.

Variance using variance() and pvariance()

Also known as the second moment about the mean, Variance is the measure of the spread of data whose value indicates how much the data points are grouped together or are in clusters or spread away about the mean. variance() returns the sample variance of the data while pvariance() returns the population variance of the data.

data = [0.1, 0.2, 0.2, 0.4, 0.3, 0.3, 1.8, 1.2, 1.0] 
data_mean = statistics.mean(data)
print("pvariance of data is: ",statistics.pvariance(data,data_mean))
print("variance of data is: ",statistics.variance(data,data_mean))

The second parameter in these functions is the mean of the data which is by-default None unless specifically defined.

pvariance of data is:  0.3054320987654321
variance of data is:  0.3436111111111111

Standard Deviation using stdev() and pstdev()

These functions return the standard deviation of the data. stdev() return the sample standard deviation (the square root of the sample variance) while pstdev() returns the population standard deviation (the square root of the population variance). Let us see an example using the data we have used while finding variance.

print("pstdev of data is: ",statistics.pstdev(data))
print("stdev of data is: ",statistics.stdev(data))

pstdev of data is:  0.5526591162420394
stdev of data is:  0.5861835131689658

We hope you like this tutorial and if you have any doubts, feel free to leave a comment below.

You may like to read.

Using Bisect Module in Python

Aggregate and Statistical Functions In Numpy

OS Module in Python