“It takes months to find a customer, but only seconds to lose one.”
Maintaining a high-quality customer service experience while minimizing costs is high on the list of any e-Commerce enterprise. An AI-based question-answering system can do just that. But how would one approach building such a system? Recent advancements in deep learning and natural language processing (NLP) hold much promise in this field.
Products may contain a collection of structured and unstructured attributes such as catalog descriptions, specifications, or customer reviews. This information can be used to find any query’s relevant information in the form of a text snippet. This task is known as Question Answering (or QA) within the domain of natural language processing.
Much research has already been done in this field. Some datasets have been developed, such as WikiQA, TriviaQA and others. This Papers With Code site contains a review of datasets. Models based on the state-of-the-art Transformer architecture like BERT, GPT-2, XLNet, or SpanBERT show impressive performance. The best results are achieved by ensembling these models with models of other architectures.
Question Answering requires large datasets for training. One of the most popular datasets for training is SQuAD (Stanford Question Answering Dataset), a dataset developed at Stanford University. SQuAD contains more than one hundred thousand questions and answers in the form of text snippets from articles derived from Wikipedia. All questions and answers in the dataset were selected and formed by humans. There are two versions of SQuAD. Version 1.1 derives answers for all the questions directly from the text snippets. Version 2.0 allows answers to be inferred, but not directly answered, from the text snippets.
Transfer learning is a machine learning technique initially developed in computer vision. The main idea of transfer learning is to reuse data learned on large-scal generic tasks for different, often more specific, tasks. Transfer learning allows us to successfully apply a deep learning approach to the domains and problems where dataset sizes are relatively limited. Both BERT (Bidirectional Encoder Representations from Transformers) and XLNet utilize transfer learning, with BERT producing more accurate results.
In deep learning, every consecutive layer of the neural network relies on the representation learned by the previous layer in the effort to achieve a training goal. The starting layers of the network learn somewhat abstract and generic data representations, while final layers are learning representations specifically for the task at hand and therefore not very useful for the different tasks. This technique, called fine-tuning, is a common practice in transfer learning to retrain the final, task-specific layers of the network for a new task while keep starting, generic layers intact.
A deep learning approach to natural language processing relies on a language model based on the principle, “you shall know a word by the company it keeps.” This principle states that, in human language, it is possible to predict a word by looking in its context, and while doing that it is possible to learn the semantics of words.
Language models (LM) are trained on a large-scale unannotated dataset, such as Wikipedia, to capture statistical relations between words, character and word n-grams, or even sentences. In NLP, fine-tuning is done by retraining the generic language model to a data of a particular domain of interest, thus modifying the learned distribution of the sequences of words. For example, the word “mouse” has different semantic, and therefore context, in Wikipedia and on a computer store site. Further fine-tuning of a language model occurs when training a new model in the context of a domain-specific task, such as QA.
Modern NLP architectures, such as BERT and XLNet, employ a variety of tricks to train the language model better. When predicting a word based on its context, it is essential to avoid self-interference. The words should not be able to “see themselves.”
In the case of BERT, some words in the input are masked, and other words are conditioned to predict them. A unique token or a random word is used as a mask. Since not only relations between words are important, but also the relations between sentences, BERT is also trained to predict if a sentence is the next sentence for another one. The paper “Pre-training of Deep Bidirectional Transformers for Language Understanding” and the article “The Illustrated BERT, ELMo, and co.” contain more detailed explanations of BERT. Open source BERT implementations and pre-trained models are available both for TensorFlow and PyTorch.
Instead of predicting masked words, XLNet maximizes the expected log likelihood of a sequence for possible word permutations so that the model “sees” both left and right contexts. While BERT suffers from the pre-train fine-tune discrepancy (masks are not present in fine-tuning data), XLNet does not. XLNet implementation and pre-trained models are available on GitHub.
In the case of QA, the dataset typically consists of the question, document text, and starting and ending indices within the document text related to the correct answer to the question. The trained QA model, therefore, predicts where to cut a continuous snippet from the document based on the question asked.
So, how to apply this theory to practice? Let’s dive into a concrete example.
In our example implementation, we use the DeepPavlov library, an open-source NLP library developed at the Moscow Institute of Physics and Technology that contains many pre-trained NLP models with a common API, including some Question Answering models. There are two models available for the SQuAD 1.1 task, BERT and R-Net. The DeepPavlov team reported BERT provided more accurate answers than R-Net.
We collected a dataset of users' reviews, questions and answers about laptops from the online catalog and labeled it according to SQuAD 2.0 standards. We started by testing some pre-trained models on it: one from the DeepPavlov library and one described in the Xu et al. 2019 paper. We proceeded by fine-tuning XLNet on SQuAD 2.0, as no fine-tuned for Question Answering XLNet weights are publicly available.
The Xu et al.’s model (called BERT for RRC, or Review Reading Comprehension) was trained first on SQuAD 1.1 and then on a dataset of users reviews for laptops.
All three models make reasonable predictions used as pre-moderated answers to customers' questions. For example, to the question “How long is the battery life?” a model outputs “Battery life is strong.” To the question “What version of Windows comes with this laptop?” a model outputs, “it comes with Windows 10 Home” (words in bold are model predictions, and the rest is their context. Including context into the answer makes it easier to understand).
The table shows more examples of outputs from the three models:
Question | DeepPavlov’s BERT answer | BERT-for-RRC answer | XLNet answer | User answer |
---|---|---|---|---|
Is this model backlit? | It doesn't have a backlit keyboard | It doesn't have a backlit keyboard | It doesn't have a backlit keyboard | No |
Can I game on this? | Kudos to Dell on this one.* | It is not recommended for gaming or editing of any kind. | It is not recommended for gaming or editing of any kind. | You can't 'game' much at all... this is slower than a pixel phone. |
Does this model have a fingerprint sensor? | There is a problem that it doesn't have a separate numeric pad | laptops / asus - vivobook - s 15-S510UN / features / ) , it does NOT include a fingerprint sensor. | laptops/ASUS-VivoBook-S15-S510UN/Features/), it does NOT include a fingerprint sensor. | None that I have found. |
* DeepPavlov’s model gave a wrong answer because Wikipedia contains an article about a game called Kudos.
BERT that was post-trained on a dataset for a particular task (not only on SQuAD) provided more accurate answers. XLNet outperforms BERT even without fine-tuning on a domain-specific dataset. The problem of underrepresentation of some domain-specific knowledge in SQuAD can be solved by post-training on a small domain-specific dataset (like in the case of BERT-for-RRC) or by adding much more data while training the language model (like in the case of XLNet). XLNet handled long sequences better and produced longer answers than BERT. Document texts in our dataset are typically longer than SQuAD’s because we merged multiple reviews of the same product into one. This merging, however, did not affect performance.
Neural network architectures based on Transformer have been reported to outperform humans on the SQuAD 2.0 dataset. Real-life datasets can be more challenging than datasets developed in laboratories for NLP competitions as customers have unlimited imagination on what they can ask. A well-designed question answering system can offer significant assistance to customer services augmenting support personnel. As these systems continue to improve, they may soon be capable of performing as well as, or even better, then their human counterparts.