How to use suffix arrays to combat common limitations of full-text search

How to use suffix arrays to combat common limitations of full-text search

Jun 27, 2017 • 8 min read

Open source full-text search engines provide rich functionality out of the box. However, there are some use cases when naive implementation may lead to terrible customer experience. We faced one of such use cases when we worked with a patent management company. 

Part of a patent officer's job is to sift through an international patent database to make sure that new patent filings do not in conflict with any of the existing patents. To do that, they need to find patents by small marker words that oftentimes are parts of bigger compound words or phrases. Most full-text search engines don't handle this problem well due to how they index documents, leading to poor performance and inefficient use of of patent officers' time. In this article we will discuss the problem and present a solution in the context of Lucene-based engines

The problem with full-text compound data and search

What is it about long, full-text, compound data that search engines find hard to cope with? Well, most search engines, including those based on Apache Lucene, operate in a manner analogical to the back-of-the-book index: data is listed alphabetically. As such, they can handle search according to the prefix only. This causes major problems with processing phrases that are long and contain compound data, like in the case of patent files and filings.

The standard way of running a search on this type of query uses a method called leading asterisk. This requires the introduction of what is called a wildcard symbol — the asterisk  to replace any unknown parts of the phrase. Unfortunately, this method doesn't work well in Lucene because of how data is indexed. It results in sluggish processing times of at least several seconds, or longer, depending on the index size and the number of entries.

Typically, people rarely use leading asterisks when searching, and waiting for several seconds in most cases seems fine, so this issue may not seem hugely problematic. But, when a new patent filing needs to be checked against thousands of existing entries, all of which have long, compound, technical names, the delay can be costly. Searching through thousands of complex records can lead to major delays, mistakes, and high inefficiency in the day-to-day work of patent officers.

Here are some examples to give you an idea of what types of elements can be part of such a query:

  • *CH2* (part of a chemical formula),
  • *kondensator* (a German word for "condensator", which can be a part of a longer compound word,
  • *ACCG* (part of the DNA sequence),  
  • and anything else that constitutes the middle section of a phrase.

Notice that we’ve used asterisks to indicate the wildcard on each side of the search phrase. This means that the part of the phrase we’re using as a keyword is located somewhere in the middle of the indexed entry.

There are a few workarounds to the leading asterisk method. For example:

  • Splitting compound words into several words. 
    • For example, we can split and store the compound German word “Fallschirmspringerschule” (which means "parachute jumping school") into three separate words:
      • “Fallschirm” (parachute)
      • “springer” (jumper)
      • “schule” (school).
    • Theoretically, if you’re searching for an exact string that occurs in the middle of another string (in this case: "springer"),  indexing separate variants of the string allows you to find your entry without using the wildcard. However, this approach assumes that you know the exact indexed string you’re searching for, and that it has indeed been indexed.
  • Add an index for the reversed form of words.
    • This fakes the functionality of searching the index data back to front.

In some cases, this functionality is just left as it is, because of time limitations during programming. This means that there is no solution implemented, despite the awareness of the issue and the lack of well-functioning alternatives. As a result, users are forced to perform a wildcard query based on the Finite State Tranducer (FST), and face possible performance issues.

None of the above solutions truly solve the problem, especially the last one, which ignores it altogether. But, as we know, fast processing of complex phrases is a must for certain queries and business cases, like for the patent office example described above.

To cope with this problem, Grid Dynamics has successfully implemented a suffix array data structure for one of our clients. We set out to implement a universal solution that fixes the leading asterisk search performance problem out of the box. Let’s get into how it works.

How will a suffix array help?

We need to begin by explaining what a suffix array is. In general, a suffix array for any string set is a sorted list of all the suffixes for that set. In this context, a suffix is broadly understood as any element that comes after the first element.

As an example, let’s build a suffix array for one string – apple.

We will begin by appending a null symbol to the end of the strings. For this example, we're going to use the “$” symbol to signify null. We do this so that the search understands where a given suffix in the array ends. Therefore, we can define the following suffixes:

  •  apple$
  • pple$
  • ple$
  • le$
  • e$

Now, we need to assign a number to each suffix.

0 apple$
1 pple$
2 ple$
3 le$
4 e$
5 $

Let’s sort these suffixes in alphabetical order. Remember, Lucene-based search works like an index in the back of a book.

5 $
0 apple$
4 e$
3 le$
2 ple$
1 pple$

Lastly, to aid the storage optimization process, we will save order of the suffixes.

In the chart below, the "Position in array" row reflects the number of the position in alphabetical order (as seen in the chart above). The "Number of suffix" row refers to our original numbering, seen in the first chart. We end up with this result:

Position in array 0 1 2 3 4 5
Number of Suffix 5 0 4 3 2 1

 

Implementation approach

To implement this solution, we're going to follow the methodology described above for each term in the index. This means we need to index all unique terms in all document. It entails concatenating all the terms together in a big string, and separating them with the null symbol (once again, for clarity, we're going to use the "$" sign). The result is a big suffix array that contains each term in the index.

The question remains: how will a suffix array help us find all the words containing a given string in the middle of the word? Let’s look at the following example to gain a better understanding.

Let's suppose we have a few terms: port, sort, test, vest, rest. This would be the concatenation of all the strings: port$sort$test$rest$. And the suffix array would look like this:

$
$rest$
$sort$test$rest$
$test$rest$
est$
est$rest$
ort$sort$test$rest$
ort$test$rest$
port$sort$test$rest$
rest$
rt$sort$test$rest$
rt$test$rest$
sort$test$rest$
st$
st$rest$
t$
t$rest$
t$sort$test$rest$
t$test$rest$
test$rest$

If we query a partial term with a leading asterisk, we can find all the suffixes that start with that part. As long as all the suffixes are sorted, this can be easily accomplished with a binary search of logarithmic complexity. Our search will show all the complete terms from the index that match our given wildcard, not just their suffixes.

Let’s see how this works in practice. Given the data above, we will query the partial term "or" surrounded by asterisks. Our query will look like this: *or*.

In order to do this, we need to perform a binary search in this array, to find the lower bound:

  • in this case port, which corresponds to the number - 1

To find the upper bound:

  • in this case sort, which corresponds to the number – 6

The suffix array causes this search to yield results at a blazing fast speed,  as the search now looks for the position of an indexed term in the array, rather than for the phrase itself.

It’s important to note that this solution requires some additional resources. Generally, a symbol in the Lucene index is represented by one byte. The suffix array approach uses this byte in the string with all the words, and also uses integer data type (int) to represent the suffix starting with the given letter. int is represented by four bytes, so the suffix array stores at least five bytes per each letter for every unique term in the index. The overall overhead for the construction of the suffix array can be difficult to calculate, as there are a lot of factors to consider.

One way to get around this issue is by using a compressed suffix array, which helps eliminate some RAM limitations by reducing the size of the suffix array. Additionally, constructing a suffix array takes a lot of time, so we use a special thread pool to do it.

Technical comparisons

In order to demonstrate the effectiveness of this solution, we ran a series of tests to compare the processing speed against the classical FST wildcard methodology. For the technical comparisons, we used dump files from Wikipedia. We stored only title and full-text fields, and used only the full-text field for querying. The queries consisted of two or more letters with asterisks at the beginning and the end of the queried term. The cache was never in use, to prevent the responses from caching.

The Solr was built based on trunk (7.0.0-SNAPSHOT plus the suffix array patch). We set Xmx to 8G. All the testing was done on i5-4300M, 64bit Fedora 25 (Linux core - 4.9.14). The results were measured in milliseconds (ms).

Here are the results:

First version of the index

40k documents from Wikipedia with the total index size of 600 mb. All measures were completed after the initial run (e.g. warming up the JVM).

Random queries of size 3:

Metrics Wildcard based on FST Suffix array improvements
Mean qtime 134.2 ms 1.7 ms

 

Random queries of size 4:

Metrics Wildcard based on FST Suffix array improvements
Mean qtime 133.6  0.1 ms

 

Second version of the index

400k documents with the total index size of 2.78 Gb. All measures were completed after the initial run.

Random queries of size 3:

Metrics Wildcard based on FST Suffix array improvements
Mean qtime 380.4 ms 5.4 ms
Max qtime 452 ms 144 ms
Min qtime 373 ms  1 ms

 

Random queries of size 4:

Metrics Wildcard based on FST Suffix array improvements
Mean qtime 133.6 0.1 ms
Max qtime 433 ms  13 ms
Min qtime 372 ms   0 ms 

 

Full queries of size 3 (all combinations from “aaa” to “zzz”):

Metrics Wildcard based on FST Suffix array improvements
Mean qtime 379.9 ms 5.15 ms
Max qtime 476 ms 432 ms
Min qtime 373 ms 1 ms

 

2 separate threads, doing queries of size 3:

Metrics Wildcard based on FST Suffix array improvements
Mean qtime 392 ms 5.2 ms
Max qtime 570 ms 450 ms
Min qtime 374 ms 1 ms

 

3 separate threads, doing queries of size 3:

Metrics Wildcard based on FST Suffix array improvements
Mean qtime 490 ms 7.8 ms
Max qtime 696 ms 571 ms
Min qtime 373 ms 1 ms

 

As you can see, the implementation of suffix arrays had a major impact on the query processing speed, with response times significantly dropping down - up to 482 ms on average.

Conclusion

In this post we explained how implementing a suffix arrays can be workaround for speeding up compound string queries for a Lucene-based search engine. This approach has many real-life applications and has already been successfully implemented by Grid Dynamics for some of our clients. We have also submitted our patch to Lucene JIRA https://issues.apache.org/jira/browse/LUCENE-7639 and received extremely valuable feedback from the community.

As always, all our technology blueprints are open. If you have issues with similar use cases, feel free to use this patch. It’s especially helpful in tests where your query is at least 3 symbols long - you will get extremely fast responses.

Don’t forget to subscribe to our blog to get the latest open-source blueprints for search, QA, real-time analytics, In-Stream Processing and more. And feel free to comment below.

Subscribe to our latest Insights

Subscribe to our latest Insights