High-Performance and Reliable Market Basket Analysis using Item-to-Item Collaborative Filtering and Multiple Evaluation Metrics
1 ) Motivation
During my first job at a startup as a both Data Scientist and Machine Learning Engineer, I was about developing the bundling recommendation feature as my first ever project. Do you feel familiar with bundling? When you viewing some stuff through Amazon or some well known online marketplaces, you will see notions like “Frequently Bought Together”, “Customers Who Bought This Also Bought This” or even finding some bundling discount of some offline marketplaces.
Back then I was stumbling through well-known algorithms for Market Basket Analysis such as Apriori, Eclat, and high-performance FP-Growth algorithms by applying some available Python libraries. At the time, I was not satisfied using Apriori and Eclat due to their low performance in handling Big Data. FP-Growth algorithm is a good replacement of prior two algorithms as the higher performance counterpart. But, I was not pretty satisfied by the given algorithm for handling item’s relationship score using our choice of metrics, and it is quite hard to visualise since I want to know the relationship score of each product with other product using my choice of metric.
We also must know that the bigger the number of product, the number of relationships/affinities would be increasing not in a linear fashion making a large number of products hard to compute efficiently. Writing an efficient algorithm will significantly contribute to how many products we can predict.
In here, I used my own written Python library to determine the relationship score between each product and tried to explain the possible work to be done to simplify the procedures.
2) Market Basket Analysis Overview — Support, Confidence, and Lift Metrics
Market Basket Analysis is one of the ways to predict the affinity between two products, defining which products are good to be bought together. Several of the famous applications of market basket analysis in retail are bundling discount, determining product positioning in supermarkets, or even some famous online marketplaces’ product recommendations. In retail, we might have thousands number of products and defining which products would be placed next to other products is kinda excruciating using only gut feeling.
It is well known in market basket analysis, we use association rules to determine the affinity scores between two products or items. Therefore, you must be familiar with some of the metrics called support, confidence, and lift in association rules defined as
By considering these three metrics, we can rank and bundle our products with your own rules. But, you know these metrics have flaws in determining the frequency of the items.
If product A has been bought 10 times, product B has been bought 20 times, and both bought together 5 times, we would have 4 metric values to consider:
“1.” support(A) = 10
“2.” support(B) = 20
“3.” support(A, B) = 5
“4.” confidence(A -> B) = 5 / 10 = 0.5
In comparison to another case, product C has been bought 1000 times, product D has been bought 2000 times, and both have been bought together 500 times, The metrics to consider would be
“1.” support(C) = 1000
“2.” support(D) = 2000
“3.” support(C, D) = 500
“4.” confidence(C -> D) = 500 / 1000 = 0.5
We would fail by considering confidence as a sole ranker metric since we know that C and D have a better affinity score than A and B. If we rank the affinity scoring by sorting by confidence and then supports, we might have solved the case above. But for many cases, confidence will fail the ranking evaluation because it ignores frequency.
However, in this value, I will introduce you a metric called information gain, a ranker metric that tries to summarise 4 metric values above to know which affinity score is higher and which one is lower by using one single metric.
3) Algorithm Challenge
In market basket analysis, we are trying to connect each product to other product and defining the relationship/affinity score about how possible two items are bought together. If we have 2 products, we would have 2 affinity score to calculate. If we have many products, say as N, the number of affinity scores that we need to find is N(N-1). This number defines that the quadratic growth in the number of products would always make any algorithm wouldn’t suffice in tackling millions of different products (this might getting worse by having bundles of 3, 4, or even 5 products). The depiction is shown below
In here you can use the library pyasrule
I have written to solve the algorithmic problem.
4) Dataset
I have been finding datasets elsewhere but it turns out I found a bakery transaction dataset in Kaggle (source: https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery/version/1#BreadBasket_DMS.csv ). You need to download the data in your machine. The dataset is a transaction data consisting of Date, Time, Transaction, and Item columns. We need to call upon the data by writing down syntaxes below
import pandas as pdtransaction_df = pd.read_csv("BreadBasket_DMS.csv")
5) `pyasrule` library
In order to solve this problem. I wrote a library called pyasrule to generate several relationship metrics
such as Jaccard Similarity
, cosine similarity,
and Information Gain
. In order to install the library in your machine, you can summon it using pip
pip install pyasrule
We can call this package by importing
import pyasrule as rule
6) Complete Notebook
7) Evaluation Metrics
In the previous Notebook, we have shown several metric evaluations and how the metrics showing the recommendations for Coffee.
7.1) Confidence
described in section 2
7.2) Lift
described in section 2
7.3) Jaccard Similarity
The formulation of Jaccard similarity in terms of Association Rule language can be written as
7.4) Cosine Similarity
The formulation of cosine similarity in terms of Association Rule language can be written as
From the equation above, we can define
7.5) Lift
The formulation of lift is given below:
7.6) Information Gain
The formulation is given below:
information_gain (a, b) = entropy([a_true_b_true + a_true_b_false, a_false_b_false + a_false_b_true], base = 2) + entropy([a_true_b_true + a_false_b_true, a_false_b_false + a_true_b_false], base = 2)-entropy([a_true_b_true, a_true_b_false, a_false_b_false, a_false_b_true], base = 2)
With
a_true_b_true = support(a, b)
a_true_b_false = support(a) - support(a, b)
a_false_b_true = support(b) - support(a, b)
a_false_b_false = number_of_total_transactions-support(a)-support(b)+support(a,b)
Discussion
After evaluating the above metrics. Information gain has provided single consideration after combining all of the metrics including support(a), support(b), support(a,b), total number of transactions and so on. In comparison to other metrics, this metric has reduced the possible ranking fault in two cases (by comparing two relationships (a to b) and (c to d)):
- Case when in terms of probability is the same, Information gain can decide that c — d has higher affinity score than a — b.
- Support(a) = 100, Support(b) = 200, Support(a,b) = 50
- Support(c) = 1000, Support(d) = 2000, Support(c,d) = 500
2. Case when an item is purchased a lot, then it will dominate the top recommendation.
Information Gain is good enough to choose the highest affinity. One of the best approaches I have used is by using Information Gain to choose 5 or 10 best affinity scores and rearrange the ranking by using Confidence as the final metrics.
References: