Association Rule Mining in Python

Post Views: 2,052

Hello everyone, In this tutorial, we’ll be learning about Association Rule Mining in Python (ARM) and will do a hands-on practice on a dataset. We will use the apriori algorithm and look on the components of the apriori algorithm. Let us start this tutorial with a brief introduction to association rules mining.

What is Association Rule Mining and its benefits?

Association Rule Mining is a process that uses Machine learning to analyze the data for the patterns, the co-occurrence and the relationship between different attributes or items of the data set. In the real-world, Association Rules mining is useful in Python as well as in other programming languages for item clustering, store layout, and market basket analysis.

Association rules include two parts, an antecedent (if) and a consequent (then) that is the if-then association that occurs more frequently in the dataset.

For example, {Bread} => {Milk} can be an association in a supermarket store. This relation implies that if(antecedent) a person buys Bread then(consequent) most probably the customer will buy Milk. There can be lots of relations between several itemsets that can be used to make the layout of the store. With this, customers would not require to go far to look for every product. To increase sales of the store these products can have combined discounts and there are many other ways these associations are helpful.

For this tutorial, we’ll be using a dataset that contains a list of 20 orders including the name of order items. You can download the dataset by clicking here. The dataset will look like this.

CSV Dataset

There are many algorithms that use association rules like AIS, SETM, Apriori, etc. Apriori algorithm is the most widely used algorithm that uses association rules and we will use this in our code. Now let us import the necessary modules and modify our dataset to make it usable.

Importing and Modifications in the Dataset

Here we are going to understand association rule mining with the help of apyori Python library. So let’s continue reading…

Install the apyori library using the command line by running the following pip command.

pip install apyori

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from apyori import apriori

Now, let us import the data and apply some modifications to the data. Go through the code below.

data = pd.read_csv(r"D:\datasets(june)\order_data.csv",delimiter=" ",header=None)
data.head()

The parameter delimiter=” “ will split the entries of the data whenever whitespace is encountered and header=None will prevent taking the first row as the header and a default header will be there. After this, our data frame will look like this.

modified data data frame

Let us see some Components of the Apriori Algorithm that are necessary to understand to make a good model.

Components of the Apriori Algorithm

There are three main components of an Apriori Algorithm which are as follows:

Support – It is the measure of the popularity of an itemset that is in how many transactions an item appears from the total number of transactions. It is simply the probability that a customer will buy an item. The mathematical formula to represent support of item X is

S(X)=(Number of transaction in which X appears)/(Total number of transactions)

Calculating the support value for {Bread} in our dataset

No. of transactions in which Bread appears = 11

No. of total transactions = 20

Support({Bread}) = 11/20 = 0.55

- Minimum Support Value = It is a threshold value above which the product can have a meaningful effect on the profit.
Confidence – It tells us the impact of one product on another that is the probability that if a person buys product X then he/she will buy product Y also. Its representation in mathematical terms is

Confidence({X} => {Y}) = (Transactions containing both X and Y)/(Transactions containing X)

Calculating the Confidence ({Bread} => {Milk}) in our dataset

It means that the likelihood of buying Milk if Bread is already bought.

No. of transactions in which both Bread and Milk appears = 5

No. of transactions containing Bread = 11

Confidence ({Bread} => {Milk}) = 5/11 = 0.4545

A major drawback of the confidence is that it only considers the popularity of item X and not of Y. This can decrease the confidence value and therefore can be misleading in understanding the association between different products. To overcome this drawback we have another measure known as Lift.

Lift – Overcoming the limitation of confidence measure, Lift will calculate the confidence taking into account the popularity of both items. Representation of lift in mathematical terms is

Lift({X} => {Y}) = Confience({X} => {Y}) / Support(B)

If the lift measure is greater than 1, it means that the Y is likely to be bought with X, while a value less than 1 indicates that Y is unlikely to be bought with X. A lift value of near 1 indicates that both the itemsets in the transactions are appearing often together but there is no association between them.

Calculating the Lift({Bread} => {Milk}) in our dataset

Confidence ({Bread} => {Milk}) = 0.4545

Support (Milk) = 9/20 = 0.45

Lift({Bread} => {Milk}) = 0.4545/0.45 = 1.01

Practical Implemenation of Apriori Algorithm

Using the data-set that we have downloaded in the previous section, let us write some code and calculate the values of apriori algorithm measures. To make use of the Apriori algorithm it is required to convert the whole transactional dataset into a single list and each row will be a list in that list.

data_list = []
for row in range(0, 20):
    data_list.append([str(data.values[row,column]) for column in range(0, 9)])

algo = apriori(data_list, min_support=0.25, min_confidence=0.2, min_lift=2, min_length=2)
results = list(algo)

We have created a list of lists, then use the apriori method from apriori module and finally covert the datatype from the generator into a list and save in a variable named results. To make proper decisions and increase the speed of the apriori algorithm, apriori methods take several arguments which are as follows –

data – The first parameter that takes the list that contains the transactional data in inner lists.
min_support – It is the threshold support value for the items that should be taken into account. Suppose we want to make decisions for our dataset and want to include only those items that are appearing in at least 5 transactions out of total i.e support value of 5/20 = 0.25.
min_confidence – It is the threshold confidence value that should be there between each combination of an itemset. we have taken the confidence value of 0.2.
min_lift – It is the minimum lift value for the rules that are selected. Generally, we take lift value equals to 2 or more to filter out those itemsets that have a more frequent association.
min_length – The numbers of items that are to be considered in the rules.

Let us see the output of the above program and print the first 3 rules that we have obtained.

for i in range(0,3):
    print(f"Required Association No. {i+1} is: {results[i]}")
    print('-'*25)

Required Association No. 1 is: RelationRecord(items=frozenset({'toothpaste', 'brush'}), support=0.25, ordered_statistics=[OrderedStatistic(items_base=frozenset({'brush'}), 
items_add=frozenset({'toothpaste'}), confidence=1.0, lift=2.5), OrderedStatistic(items_base=frozenset({'toothpaste'}), items_add=frozenset({'brush'}), confidence=0.625, lift=2.5)])
-------------------------
Required Association No. 2 is: RelationRecord(items=frozenset({'mouthwash', 'toothpaste'}), support=0.3, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mouthwash'}), 
items_add=frozenset({'toothpaste'}), confidence=0.8571428571428572, lift=2.142857142857143), OrderedStatistic(items_base=frozenset({'toothpaste'}), items_add=frozenset({'mouthwash'}), confidence=0.7499999999999999, lift=2.142857142857143)])
-------------------------
Required Association No. 3 is: RelationRecord(items=frozenset({'honey', 'bread', 'butter'}), support=0.25, ordered_statistics=[OrderedStatistic(items_base=frozenset({'butter'}), 
items_add=frozenset({'honey', 'bread'}), confidence=0.625, lift=2.0833333333333335), OrderedStatistic(items_base=frozenset({'honey', 'bread'}), items_add=frozenset({'butter'}), confidence=0.8333333333333334, lift=2.0833333333333335)])
-------------------------

Understanding the Output

Considering the association no. 1 from the above output, first, we have an association of toothpaste and brush and it is seen that these items are frequently bought together. Then, the support value is given which is 0.25 and we have confidence and lift value for the itemsets one by one changing the order of the itemset. For example, Confidence and Lift measures for the likelihood of buying toothpaste if a brush is purchased are 1.0 and 2.5 respectively. The Confidence and Lift measures after changing the order are 0.625 and 2.5 respectively.

Try to change the different parameters and see the changes in the results.

We hope you like this tutorial and if you have any doubts, feel free to ask in the comment section.

You may like to read from some of our articles given below: