High-Performance and Reliable Market Basket Analysis using Item-to-Item Collaborative Filtering and Multiple Evaluation Metrics

Norma Dani Risdiandita
6 min readJul 4, 2019

1 ) Motivation

During my first job at a startup as a both Data Scientist and Machine Learning Engineer, I was about developing the bundling recommendation feature as my first ever project. Do you feel familiar with bundling? When you viewing some stuff through Amazon or some well known online marketplaces, you will see notions like “Frequently Bought Together”, “Customers Who Bought This Also Bought This” or even finding some bundling discount of some offline marketplaces.

Not exactly like this image I suppose. Image from Google
Amazon’s frequently bought together

Back then I was stumbling through well-known algorithms for Market Basket Analysis such as Apriori, Eclat, and high-performance FP-Growth algorithms by applying some available Python libraries. At the time, I was not satisfied using Apriori and Eclat due to their low performance in handling Big Data. FP-Growth algorithm is a good replacement of prior two algorithms as the higher performance counterpart. But, I was not pretty satisfied by the given algorithm for handling item’s relationship score using our choice of metrics, and it is quite hard to visualise since I want to know the relationship score of each product with other product using my choice of metric.

We also must know that the bigger the number of product, the number of relationships/affinities would be increasing not in a linear fashion making a large number of products hard to compute efficiently. Writing an efficient algorithm will significantly contribute to how many products we can predict.

In here, I used my own written Python library to determine the relationship score between each product and tried to explain the possible work to be done to simplify the procedures.

2) Market Basket Analysis Overview — Support, Confidence, and Lift Metrics

Market Basket Analysis is one of the ways to predict the affinity between two products, defining which products are good to be bought together. Several of the famous applications of market basket analysis in retail are bundling discount, determining product positioning in supermarkets, or even some famous online marketplaces’ product recommendations. In retail, we might have thousands number of products and defining which products would be placed next to other products is kinda excruciating using only gut feeling.

It is well known in market basket analysis, we use association rules to determine the affinity scores between two products or items. Therefore, you must be familiar with some of the metrics called support, confidence, and lift in association rules defined as

Support (a) is the number of transactions purchasing product a. When a customer bought product a, then, support(a) increased by one.
Support (a, b) is the number of transactions purchasing both a and b. When a customer bought a coffee and bread at the same time, then, support(coffee, bread) increased by one.
confidence a to b is the probability of the number of transactions purchasing b in the number of transactions purchasing a. The probability of two metrics above.

By considering these three metrics, we can rank and bundle our products with your own rules. But, you know these metrics have flaws in determining the frequency of the items.

If product A has been bought 10 times, product B has been bought 20 times, and both bought together 5 times, we would have 4 metric values to consider:

“1.” support(A) = 10

“2.” support(B) = 20

“3.” support(A, B) = 5

“4.” confidence(A -> B) = 5 / 10 = 0.5

In comparison to another case, product C has been bought 1000 times, product D has been bought 2000 times, and both have been bought together 500 times, The metrics to consider would be

“1.” support(C) = 1000

“2.” support(D) = 2000

“3.” support(C, D) = 500

“4.” confidence(C -> D) = 500 / 1000 = 0.5

We would fail by considering confidence as a sole ranker metric since we know that C and D have a better affinity score than A and B. If we rank the affinity scoring by sorting by confidence and then supports, we might have solved the case above. But for many cases, confidence will fail the ranking evaluation because it ignores frequency.

However, in this value, I will introduce you a metric called information gain, a ranker metric that tries to summarise 4 metric values above to know which affinity score is higher and which one is lower by using one single metric.

3) Algorithm Challenge

In market basket analysis, we are trying to connect each product to other product and defining the relationship/affinity score about how possible two items are bought together. If we have 2 products, we would have 2 affinity score to calculate. If we have many products, say as N, the number of affinity scores that we need to find is N(N-1). This number defines that the quadratic growth in the number of products would always make any algorithm wouldn’t suffice in tackling millions of different products (this might getting worse by having bundles of 3, 4, or even 5 products). The depiction is shown below

Left: if we have 2 products, we have 2 possible relationships, if a customer bought product a, how probable is that customer will buy product b and vice-versa? Right: Once we have 3 products, we have 6 affinities that need to be considered. If we have N data, the number of relationships would be N(N-1)

In here you can use the library pyasrule I have written to solve the algorithmic problem.

4) Dataset

I have been finding datasets elsewhere but it turns out I found a bakery transaction dataset in Kaggle (source: https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery/version/1#BreadBasket_DMS.csv ). You need to download the data in your machine. The dataset is a transaction data consisting of Date, Time, Transaction, and Item columns. We need to call upon the data by writing down syntaxes below

import pandas as pdtransaction_df = pd.read_csv("BreadBasket_DMS.csv")

5) `pyasrule` library

In order to solve this problem. I wrote a library called pyasrule to generate several relationship metrics such as Jaccard Similarity , cosine similarity, and Information Gain. In order to install the library in your machine, you can summon it using pip

pip install pyasrule

We can call this package by importing

import pyasrule as rule

6) Complete Notebook

7) Evaluation Metrics

In the previous Notebook, we have shown several metric evaluations and how the metrics showing the recommendations for Coffee.

7.1) Confidence

described in section 2

7.2) Lift

described in section 2

7.3) Jaccard Similarity

The formulation of Jaccard similarity in terms of Association Rule language can be written as

The formulation of Jaccard similarity in terms of association rule metrics

7.4) Cosine Similarity

The formulation of cosine similarity in terms of Association Rule language can be written as

Cosine Similarity — We use market basket analysis notation to define the cosine similarity.

From the equation above, we can define

7.5) Lift

The formulation of lift is given below:

lift

7.6) Information Gain

The formulation is given below:

information_gain (a, b) = entropy([a_true_b_true + a_true_b_false, a_false_b_false + a_false_b_true], base = 2) + entropy([a_true_b_true + a_false_b_true, a_false_b_false + a_true_b_false], base = 2)-entropy([a_true_b_true, a_true_b_false, a_false_b_false, a_false_b_true], base = 2)

With

a_true_b_true = support(a, b)

a_true_b_false = support(a) - support(a, b)

a_false_b_true = support(b) - support(a, b)

a_false_b_false = number_of_total_transactions-support(a)-support(b)+support(a,b)

Discussion

After evaluating the above metrics. Information gain has provided single consideration after combining all of the metrics including support(a), support(b), support(a,b), total number of transactions and so on. In comparison to other metrics, this metric has reduced the possible ranking fault in two cases (by comparing two relationships (a to b) and (c to d)):

  1. Case when in terms of probability is the same, Information gain can decide that c — d has higher affinity score than a — b.
  • Support(a) = 100, Support(b) = 200, Support(a,b) = 50
  • Support(c) = 1000, Support(d) = 2000, Support(c,d) = 500

2. Case when an item is purchased a lot, then it will dominate the top recommendation.

Information Gain is good enough to choose the highest affinity. One of the best approaches I have used is by using Information Gain to choose 5 or 10 best affinity scores and rearrange the ranking by using Confidence as the final metrics.

References:

--

--

Norma Dani Risdiandita

Theoretical and Computational Physicist, Data Scientist, Machine Learning Engineer, and Self-Learner in Many Things