Your product is good. Your mobile app has great features, but your app store rating is low. What happened?
In a previous blog post we demonstrated that high crash rates can cause low app ratings. In this blog, we explain how we analyzed negative reviews to determine what motivated customers to write them.
We collected the most relevant negative app reviews, 2 stars or less, from six major retailers in the US on one of the major mobile applications platforms. Then, we employed natural language processing to analyze the reviews and distill specific topics that caused the complaints. We carried out this topic modeling phase with Latent Dirichlet Allocation to better understand the distribution of negative reviews with respect to common topics that users tended to bemoan.
So let's dive into the details on how we did our negative review analysis with NLP.
The first step in topic modeling is to convert sentences to individual words and phrases. These words and phrases will then be fed to the model to generate common topics. Typically, sentence breakdown consists of two parts: lemmatization, and removal of stop words and punctuation.
One key feature in human language (the English language in particular) is that one word can have a number of derivations. For instance, “I ran five miles yesterday”, “I will run five miles tomorrow”, “She is running”, and “He runs” all convey the same action. The English language changes the form of the word “run” to indicate the time at which the event takes/took/will take place and the persons pertaining to the action. To a computer, all variations of “run” should point to the same action, i.e. “run”. In this case, “run” is a lemma word for “runs”, “ran”, “running”, etc. The process of converting all derivative words to their lemma is called lemmatization. Fortunately, Python NLTK provides WordNetLemmatizer that uses its corpus database to lookup lemmas for words.
Having lemmatized the reviews, we wish to remove stop words, words which have a high frequency but do not contribute meaning to the sentence. Stop words include, but are not limited to, “and”, “but”, “as”, “whom”, and “at”. Similarly, we generally want to omit punctuation as well for model fitting. Usually, what is left after removing stop words and punctuation is a range of verbs and adjectives, which generally confer more meaning percentage-wise than having all the stop words and punctuation in.
Below is a code snippet that demonstrates lemmatization and removal of stop words and punctuation:
Now, we are ready to conduct topic modeling with Latent Dirichlet Allocation. This can be done with the help of the Gensim library available in Python. There are several crucial steps in LDA as follows:
"inputed[sic] first Macy's card in and no issues to pay or check balance. once card was upgraded it would not show any balance, or able to mqke[sic] payment."
LDA classifies this review as Dominant_Topic 0.0 (aforementioned credit card issues) with 78% certainty. We can manually double-check and have a rough idea of how well the algorithm performs.
According to our results from topic modeling, 64.4% of all negative reviews are caused by three broad categories:
Here is the complete breakdown of the issues:
Crashes and slow response time accounted for 29.3% of all negative reviews. Obviously, crash rates should be as low as technically possible. Psychologically, a negative review will carry more weight than a positive review, so the industry-accepted 1% may still be too high for large enterprises. We generally work with customers to ensure that the crash rate is 0.2% or lower so as to prevent crashes almost entirely and prevent lower app ratings.
The checkout process generated 18.4% of the negative reviews. This includes long and complicated checkout processes, a bug-ridden payment experience, etc. Users become extremely frustrated when they have selected the items of their liking only to find out that they cannot smoothly and successfully complete the transaction. Features like a smooth, one-page checkout not only close the proverbial circle for potential customers, but reflects the brand image and the company’s attention to detail.
Shopping cart issues accounted for 16.7% of negative reviews. Unsatisfied users report the disappearance of shopping cart items, issues with non-inventory items, etc. According to the Baymard Institute, around 70% of virtual carts are abandoned before purchase due to unexpected taxes and fees, having to create an account, or that the checkout process is too complicated.
Negative feedback occurs on all sites. This investigation was conducted using data from several large, national retailers. However, the results are likely typical of most mobile retailers. Retailers should conduct a detailed analysis of their own negative reviews, using techniques used in this blog post, to find the causes. This “free QA” is particularly valuable because it is unsolicited and the writers expect nothing in return. They tend to represent the “naked truth”. Fixing issues discovered in this analysis addresses user's concerns head-on, ultimately leading to higher app ratings, a salvaged brand reputation and increased revenue.
Topic Modeling with Latent Dirichlet Allocation is a powerful tool to uncover hidden commonalities from a vast swath of information. It helps identify hidden topics and relations between a sentence and the topic it is most closely related. In this study, we have condensed over 4000 negative reviews into eight categories. It is important to note, however, that LDA does not take into account the correlation between topics. For instance, we may insist that discounts and promotions be grouped with the payment process. After all, discounts are applied at the time of payment. Discounts and promotions have a high correlation to problems with payment. However, there is no readily available approach for LDA to recognize the correlation between these topics.
Additionally, the bag-of-words model on which LDA functions primarily concerns unique words and their respective occurrences; it cannot understand higher-level concepts such as semantic structures. Lastly, LDA is an unsupervised algorithm, which may not be the best option for training and testing tagged dataset.
We hope to revisit mobile application reviews using other natural language processing algorithms in the future to better parse and understand users’ intent with each review. In any case, we hope our current study proves illuminating for you and your business so that you can reach more users with the help of mobile applications and boost your mobile revenue in the years to come.
For assistance conducting analysis on your site, please contact Grid Dynamics.