Previous client case studies

The challenge

One of our e-commerce client wanted to develop a substitutability algorithm that automatically scores pairs of product on their strength of being substitute products. The aim to use this scoring to automatically recommend substitutes to customers who ordered an item that went out of stock. Originally this was done through product hierarchy and manual mapping. However, that is extremely time consuming and inefficient. We were tasked to develop this algorithm where the output should be an automated score between 0 and 1 that defines the strength of substitutability on a pair of product.

Step 01.

Discovery

In our discovery phase we first meet the internal team to understand what data exists within their platform that can help with the algorithm. At this stage we observed that there is a great deal of back - of - pack content information that exists within the database which were not getting used at all. Also, we observed that they had a database of historical substitute acceptance rates for pairs of product over last 4 years. We collected all these information in one place and decided to build the algorithm on a chosen category. Here the category chosen was broad enough to rule out having substitutes from a different category, so for example frozen pizzas and fresh pizzas were mapping both into a single category although they existed in different categories. Output from our discovery stage was this custom built broad category where we can safey assume that product pairs outside this category would have negligible chance of being a subtitute.

Step 02.

Proof of Concept

Now, to build out algorithm end to end we started developing building blocks step by step as follows.

Step 1 - Complementarity Score

One way of looking at substitute pairs of products is that a highly substitutable products should have very similar complementary products. Based on this theory we first developed a module to calculate complementarity score using Bayesian Apriori algorithm. Essentially here for every pair of product we calculated the probability of these two products appearing in same basket and then divided it by the product of their individual basket penetrations. This score if >> 1 implies that the chance of these two products appearing in the same basket is higher than the chance of them appearing in different baskets.

Step 2 - Apriori Substitutability Score

Now for each product then we obtained p-1 complementarity score ( p - total n umber of products within the search category ). Then finally we defined the apriori substitutability score for a pair of products as the cosine similarity between their complementarity score vector. We perfomed some other exploratory analysis to validate the results to make sure that our scoring is not influenced by any outlier effects.

Step 3 - Product name similarity Score

In this step we performed a cosine similarity of the product names of the pair of products. To do that we first applied few stop words and then based on the universe of all product names within the database we calculated their TF-IDF representation. Then using these TF-IDF representation we calculated the cosine similarity score.

Step 4 - Back-Of-Pack Description Similarity Score

The back of pack data for each product contained several sentences with information about the contents & ingredients of the product, the brand name , the general advertising sentences and others. Now here our objective again was to find out the similarity between these descriptions for any pair of products. Now due to these descriptions consisting of lengthy sentences we could not rely on the basic TF-IDF representation. So we used a pre-trained Bidirectional Encoder Representations from Transformers (BERT) and extracted the embeddings from this model from our back of pack descriptions. Note here that BERT is actually a language search algorithm developed by Google, but these pre-trained algorithms can be fitted on any standard English text to extract the embeddings ( features ) that can be thought of as a numerical representation of the paragraph. So we obtained this high dimensional vector of features for each product and then we calculated the BERT substitutability score as the cosine similarity score between the vectors of a pair of products.

Step 5 - Classification Model

Now lastly we trained a classifier using all these above scores to predict the likelihood of the acceptance rate of a pair of product being more than 70%.This training was done using an xgboost classifier tuned using hyper-opt to get optimum hyper-parameters. Now this training was only possible to run on products that had acceptance rates, but the coverage of this data was pretty good so our training AUC metrics came out pretty good. Now for calculating substitutability score for all pairs of products we first calculated each of the scores as mentioned in previous step and then feeding those scores into the classification model that we fitted on acceptance data, we obtained our probability which essentially measures the strength of substitutability between them.

Step 03.

Scale

The results from this solution approach proved to be way better than what existed , so we were given green light to scale this up to even larger and bigger categories. We ran several iterations over all the categories and finally created a modular function that scores every product pair on their strength of substitutability. The final prediction module was automated end to end using Google cloud platform where every quarter all un-scored products will get re-scored through the model.

Step 04.

Empower

As all of our projects , we automated the modeling pipelines end to end and provided detailed documentation on each of the steps and the python code lines. This was then deployed and handed over to client's internal data science team for on-going maintenance.

Step 05.

Support

For this solution there were no requirement for any on-going support other than a half yearly model parameter refresh which we happily contracted with our client to do charging just 10 days for a full year.

Conclusion

Our deep expertise in various domains , like natural language processing , predictive modeling and high-end modular coding skills enabled us to deliver this project end to end in record time of just 8 weeks. The entire module starting from the data to the end to end pipelines were developed and deployed inside our clients cloud platform within a couple of months. Due to this we are again getting a lot of traction from our client to work on a different project where they need some innovative solution.

Download our brochure

Natural Language Processing based Product Substitutability Algorithm