In an era where customer experience and product discovery innovations are a top priority for businesses, we are witnessing a shift from traditional recommendation systems towards session-based recommendation systems in the digital commerce sector. Why? Because they produce personalized recommendations by taking into account a user’s most recent and real-time interactions with a brand, as opposed to historical behavior only.
Traditional personalized recommender systems tend to use all user-to-item interactions to learn the preferences of each user. However, not all historical events are equally important to the current recommendation scenario. The user intent may vary significantly in different sessions and even long-term preferences usually shift over time making the recommendations obsolete. Session-based recommender systems take products in user sessions as input and generate recommendations that reflect the current user intent. Also, such recommender systems address the problem of data sparsity for users, which makes it very challenging to build reliable user profiles.
Our forward-looking Fortune 500 customer engaged Grid Dynamics to enhance the quality of their personalized recommendations with a session-based recommender. The customer was hungry to try out novel approaches employing neural networks to decide if money gains outweigh the complications in applying them in a production environment. They required a system that works with implicit feedback consisting of positive-only events like views and purchases, rather than explicit user preferences like star ratings. The customer had recently moved to Google Сloud infrastructure so the new solution also had to be cloud-native.
The ecommerce website of our customer has many recommendation zones and they require different types of recommendations. For example, it was observed that users prefer recommendations on the product details page that are heavily biased towards the anchor product. However, on the user homepage we don’t have an anchor product so recommendations are based only on the user history. At first, our customer wanted to update recommendations in the most impactful zone on the product details page so we concentrated on this scenario.
It is extremely important to keep the right balance between different metrics that the new recommender system will try to optimize. On the one hand, our customer wanted to increase the click-through rate for zones with recommendations. On the other hand, it was necessary to keep the conversion rate in check so the revenue per customer grows. Also, the recommendations should be diverse, explore the long-tail of products and to some extent surprise the users.
Another challenge was that the full catalog of our customer contains hundreds of thousands of products and it’s extremely complicated for a single model to cover the entire catalog since the user data was sparse. Further, the new recommender system needed a fast enough model under the hood so the real-time inference is possible, and it would need to run in the cloud with a latency of no more than than several hundred milliseconds. During this time, it would be necessary to get the recent user events, run model inference with these events as input and apply egress business rules.
The first step we took was to review the literature and decide on the base model. We were on the hunt for a session-based neural network model that was not very computationally complex in order for us to run the training pipeline daily and achieve fast real-time inference. The initial purpose was to deploy the new recommender system on the product details page so we were looking for a model that pays special attention to the anchor product. After carefully reviewing the approaches, we selected STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation as the model to try. This model has an attention layer and accepts the input sequence of products while having a separate path for the anchor product. STAMP was successfully applied by Home Depot and Zalando which is reflected in corresponding papers published by these companies.
At its core, the STAMP model (see Fig. 1) has an attention layer which learns how to transform trainable product embeddings which constitute a user session into a compressed representation of a session $(m_a)$. An anchor product $(x_t)$ is treated through a special path in the model, and as a result, recommended products are biased towards it which is beneficial in many scenarios like recommendations on the product details page. The goal of the model training is to learn which product out of all candidate products $(V)$ will be the next event given a user session and an anchor product.
The system we built consists of two parts implemented using the Google Cloud Platform: the training pipeline and the serving application.
The training pipeline runs in Kubeflow and consists of the following standard steps:
Because our customer has hundreds of gigabytes of clickstream data generated monthly, to speed up the pipeline, we offloaded most of the data preparation logic to Google BigQuery and used GPUs for model training.
The serving application was implemented as a cloud-based application with auto-scaling, so new instances of the serving application are created automatically to handle the load elastically. In the serving application it is necessary to get the most recent user events for STAMP to produce personalized recommendations. Therefore, all real-time user events are captured in Google BigTable, which is used by multiple recommenders, and for each incoming request, the serving application issues a call to BigTable and extracts real-time events for the inference.
To address the issue of data sparsity we clustered the product catalog and had a separate STAMP model in each cluster. The serving application loads multiple STAMP models and selects which one to use based on the anchor product in the request.
The clustering logic was based on product attribute groups. We trained a single model on the full catalog and then used hierarchical clustering based on the distance calculated as a number of cross recommendations between attribute groups.
The clustering based on product attribute groups was implemented in several steps. At first the distance matrix was calculated based on a number of cross recommendations between product attribute groups. Then the distance matrix was symmetrized and only large product attribute groups were clustered using hierarchical clustering. Clustering only large groups with many products ensures the cleaner structure of resulting clusters. Finally, large product attribute groups were frozen in clusters and all other product attribute groups were clustered.
To tailor recommendations to different scenarios we implemented a customizable set of real-time and non real-time business rules. Non real-time business rules like limiting the set of products that recommendations can come from to only available products are enacted in the training pipeline. Real-time business rules are applied on a per request basis so one set of rules can be applied in one zone and another set of rules in another zone. An example of a real-time rule would be keeping only a single product from any given product collection in the list of recommendations.
The developed STAMP recommender system was A/B tested in the product details page against the current production recommender system and showed the following lift in metrics:
Later, STAMP was tested against Google Recommendations AI and also demonstrated better or comparable production metrics in different scenarios. STAMP showed a good balance between optimizing the click-through rate and the conversion rate which resulted in increased revenue per customer.
The project was developed in three subsequent steps:
Production A/B tests may take several weeks to complete and it is very costly to test all the model modifications, so a library for offline testing of personalized recommenders was developed:
The developed serving application is fast: the max load that was served in production by the auto-scaled application was more than 1000 requests per second. Many parameters of the model inference, like the number of user events to use and business rules to apply, are parts of the request to the serving application. As a result, we can perform training only once and then deploy the recommender system to various zones by issuing requests with different parameters.
Now, the STAMP recommender system serves 100% of traffic in one of the zones on the product details page and the modifications are underway to improve the model. We are working on further enhancements to the model to address the product cold-start problem and to optimize the model for specific business metrics.
Interested in learning more about session-based recommendations? Get in touch with us to start a discussion.