With an online catalog of over 1M SKUs and a multi-billion dollar e-commerce store, things were going well for the online retailer. However, their search engine’s relevancy was not ideal, hurting their shoppers’ ability to discover the right products and lowering conversion rates.
The core problem was imperfect product attribution of catalog data. The attributes were coming from the e-commerce system in a form of text fields that contained a sizable number of errors or omissions. This time, the product management team decided to try something new - use product images as a data source to extract the attributes from the visual product features, such as color, style, height, shape, and more.
At the same exact time, a different idea took shape - perhaps it was also possible to use visual similarity between product images to offer customers an entirely new and exciting way to discover products similar to those they “like”. Upon further examination, it turned out that the underlying technology is nearly identical in both cases. The product management team brought in Grid Dynamics, as they wanted our ML experience to help them build an entirely new type of search engine, based on visual product similarity.
In this blog post, we’ll share how such an engine can be designed and trained to address visual similarity search and automated image-based catalog attribution using computer vision and machine learning.
Not all shoppers know exactly what they want when they shop. Many of them have a rough idea of what they are looking for, but are also interested in comparing items. In many domains, such as apparel or decoration, product look is a major factor in making a purchasing decision. Therefore, an accurate way of determining visual similarity is of great value to online retailers.
For product discovery, the main feature involves the ability to search for a similar-looking item in a catalog given a sample or an “anchor” item:
Here, you can see how the model can pick up the main characteristics of the sample dress and find similar-looking short striped dresses from the catalog. This feature can also be termed “visual recommendation” or “show me more like this”.
Once you have determined the visual similarity of your products, you can also unlock several other powerful features, such as “sort by similar”. This allows the customer to sort the category or the search result not only by common sorting criteria such as price or rating, but also by similarity to a particular “anchor” product, which can be explicitly selected by the customer with a “like” button on the UI:
As seen above, the category can then be re-sorted by visual similarity to the anchor product, with respect to applied keywords and filters. In a more sophisticated case, the “anchor” product can be implicitly derived from the customer behaviour, such as shopping cart content, session view history, add-to-bag, or order history.
Another interesting feature is that similarity can be sorted against multiple anchor products at once, essentially finding matches that have features of both items:
You can play with a sandbox demo that features a catalog of dresses, toys, laptops, and TVs.
Now that we’re clear on the main use cases, let's take a look under the hood and see how it all works.
As much as we like to think that our computers can “think” and “understand” us, in reality, that's not how they work. They are just exceptionally good at remembering things and crunching numbers very, very quickly. For a computer, a picture of a dress is just a set of neatly arranged bytes. Needless to say, searching for other images with similar bytes is very unlikely to yield similar-looking images.
Instead, we need to somehow convert those images into a different numerical representation that captures the essence of the image’s content and style. This conversion, known as vectorization, should work in such a way that similar-looking images will be mapped to similar vectors. Once images are converted into vectors, a wealth of algorithms are then generated to search for other similar vectors, thus finding similar images.
This approach to similarity search is known as a vector space model and is traditionally applied to text-based similarity. However, nothing prevents us from applying the same principles to other objects, as long as we can find a proper vectorization model.
The vector space model approach is simple and powerful but can be slow if applied one at a time to many single images. Applied directly to millions of products in a catalog, it can take a long time to search through all of them to find similarities to an anchor image. The main reason for this is the Curse of Dimensionality. All nearest neighbor search algorithms eventually need to calculate distances between the vectors, which becomes slower as vector dimensionality grows. As any meaningful representation of an image requires hundreds or even thousands of dimensions, the vector space model approach needs to be combined with other external processes to perform well in most practical use cases.
One common technique to speed up the nearest neighbor search is by using dimensionality reduction. Algorithms like Principal Component Analysis can reduce the dimensionality of data while minimizing the loss of information from dropped dimensions. This makes it possible to reduce vector sizes from thousands to hundreds without a noticeable loss in search quality.
Another approach to speed up the process is to use a traditional search engine to get results before filtering. Using an inverted index, a search engine can very quickly find good candidate matches for similar products, e.g. dresses. If the anchor image is a dress, this prevents the visual similarity search needing to be performed on each of the millions of products in the catalog. Instead, it can only be run on the products highlighted by the search engine results. This approach of combining a high-quality, but slower, similarity ranking with a low quality, but faster, similarity search is called search results re-ranking.
At a high level, the solution architecture is as follows:
The most important part of the visual similarity feature is the vectorization model. Since being introduced to the community in 2012, Convolutional Neural Networks (CNN) have taken the visual recognition industry by storm, continually pushing the envelope on tasks of image classification and object detection. There are a lot of different CNN architectures available, created by a host of different research groups. Description of CNN internals is beyond the scope of this post, but there are a lot of excellent explanations available. This 3D demo gives a good introduction to the internals of CNN.
The arXiv:1605.07678 shows a nice comparison of popular CNN architectures.
For our visual similarity project, we wanted to use Convolutional Neural Networks to extract latent features of images, e.g. features that are very important for capturing image meanings. These can be used to compare images, yet are impossible to map to well-known “human” categories like length, color, or shape.
CNNs internally do exactly that - in the training process they “learn” which features are important for the task at hand, encoding this information into trained network weights. All we need to do is to forward propagate an image through the trained network and take neuron activations on some hidden layer as a vector encoding the image. The exact choice of the hidden layer or hidden layers to extract features from depends on the application.
The deeper we go in the hierarchy of layers from the input image, the more information about the content of the image we obtain, e.g. our vectors encode shapes and their arrangements. Shallow layers can be used to obtain encoding of style of the image, e.g. texture, palette, and other spatially distributed features.
The simplest way to get started with extracting image features is Transfer Learning, e.g. to use pre-trained network weights that are publically available. Those weights are typically obtained by training CNN towards image classification goals on a large public training dataset such as imagenet. If your images are very different from the images available on imagenet, or you are a lucky owner of a large and well-attributed image catalog, then you will be able to re-train the model from scratch in order to obtain network weights finely tuned for your kind of images. Still, it is always a good idea to start with pre-trained weights to have a solid performance baseline to compare to.
In order to make an informed choice of the vector embedding approach, we used a simple search quality assessment method based on gathering a “golden set” of anchor images and quality assessments about how similar some other images are in the catalog to the anchor image. Once such a dataset is available, it is easy to score every vectorization implementation using a standard search quality scoring technique like Discounted Cumulative Gain.
For the purposes of apparel search, the shape and arrangement of the image is more important, thus we chose the last fully connected “bottleneck” layer as a source of image vectors. For other applications, such as a fine art search, more style-biased embeddings are generally used.
We evaluated a number of possible embeddings and found that Inception V3 performed the best for our case. Even though Inception V4 is a more recent model, it didn’t perform as well as V3. One plausible explanation is that V3 produces larger embeddings, and is able to capture more latent features.
Once we optimized the vectorization process and obtained vectors for every image in the catalog, the next step was to select a similarity search algorithm. In other words, given a vector, we should be able to find K “closest” vectors in a multidimensional vector space. To define exactly what “closest vectors” means, we have to decide on a formula to calculate a distance between the vectors. There are many different ways to define vector distance, called metrics. One of the most well-known metrics is Euclidean distance, with other popular metrics being Cosine and Manhattan metrics, which correspondingly measure angle and per-coordinate distances between vectors. We used Euclidean distance as a baseline.
Once the vector distance is defined, we can use a brute force algorithm to calculate vector distance between the anchor vector and each of the other vectors in the catalog. We can quickly obtain the top K closest neighbors in an almost linear time from the size of the catalog, but the difficulty of calculation of the distance itself, growing proportionally to the dimensionality of the data, makes the search far more complex.
There are a number of algorithms (kd-tree and ball-tree) that try to address this problem by cleverly partitioning multi-dimensional space and presenting it as a tree structure at indexing time, thus avoiding the necessity to calculate distances to all of the vectors in a catalog at query time. The advantage of those algorithms is that they still produce exact results, e.g. functionally identical to brute force approach. However, thr efficiency of such data structures quickly diminishes with the number of dimensions in the image vectors, so for very large embeddings, they become a little more than fancy wrappers over brute force vector space search.
We ended up using NMSLIB, which makes a tradeoff by obtaining an approximate nearest neighbors result, e.g. it can occasionally miss some actual nearest neighbors, sacrificing recall. Still, the performance gain is tremendous, and it was an acceptable trade-off for our application. You can always evaluate the level of mistakes created by approximate nearest neighbor search by comparing it with brute force results to keep algorithm recall under control.
As we can see, CNN-based vectorisation empowers us to tackle visual similarity of products. But why stop there? We can still look back at the product attributes and textual product information, and see if we can vectorize this data as well. Attribute data, being mostly categorical, can be vectorized using one-hot encoding, e.g. all possible values of an attribute are listed as a vector dimension and marked as 1 if a given product has such an attribute value.
For textual fields, we can use semantic encoding available from word2vec or fastText models. Again, transfer learning can be used to get word embeddings obtained from training those networks on really large corpora, such as Wikipedia. Attribute vectors may play an important role in product comparisons and in situations when visual similarity does not mean much. An example of this is in electronics.
Resulting vectors can be used separately or concatenated to obtain a compound vector embedding of the product based on all data available about a product. This should allow similarity search to tap into all features of the image: hidden features from product images, a textual description, and explicit product attributes.
Product images and product data encoding powered by deep learning models unlocks many new customer facing features - including both the “more like this” search or sort by visually similar feature. Product vectors encoding visual features can boost product recommendations and be used to automatically improve catalog attribution. All this makes image recognition capability an indispensable part of the e-commerce enterprise ecosystem.
If you are one of the many e-commerce companies with a large product catalog and are engaged in a never-ending fight for better product attribution, or if you are looking to give your customers new and exciting ways to discover your products, give us a call - we would be happy to help!
Eugene Steinberg, Timofey Durakov