“Find all customers who have a birthday in the next 10 days”
“Which employees have achieved 10 years of service this month?”
These types of questions are simple for a human to answer with the aid of a calendar, but it is much more difficult for a computer to answer them on its own. Calendars have many rules built into them that are not immediately obvious to users that must be accounted for by a computer.
One of our customers was looking to easily query customer milestones like birthdays, anniversaries, graduation dates and years of service. To help them execute queries for their customer marketing campaigns, we built a data layer to support high performance range queries with Solr. This blog post will discuss the various issues we encountered, and how we solved them with a Solr search engine that leveraged range and function queries.
Let’s focus on one particular problem - birthday searches. Other date search-related use cases are conceptually similar. We need to find all customers with a birthday between today and some time in the near future, and we have available the following data:
So we want to find all of the customers who satisfy our search criteria, and sort them by the number of days until their birthday, starting from the closest birthday. This means that we will need to have a synthetic field in our search results, “days_to_birthday”, which is sorted in ascending order.
We will also have to consider the challenges presented by calendar issues such as:
We can obviously bruteforce this problem with a table scan, inspecting each customer record separately. However, this option works poorly for large customer databases, so we want to focus on more sophisticated index-based solutions.
There is a straightforward, but naive approach that could be taken - to index all birthdays within the next 20 years in separate fields. For example to have: “person_date_of_birth.2018”, “person_date_of_birth.2019”, … But this is not a good long term solution as you have to deal with ‘border’ cases, and join together years when requests cross into the following year. Moreover, it pollutes the database with many unnecessary fields, which is not ideal when using Solr/Lucene.
As long as date-related information is static, we can enrich our documents with additional “synthetic” fields. Given the “person_date_of_birth” initial field value, we can extract a useful piece of information from it; the ordinal of the day within the year, like in Java’s Calendar#DAY_OF_YEAR (e.g. “person_date_of_birth.yday”). It can be done either during the data ingestion phase (e.g. from DIH), or it can be a custom UpdateRequestProcessorFactory. In our GitHub sample, it is a separate Indexing component, which is responsible for such types of decoration: DocumentIndexer.
Example Input:
{
“person_first_name”: “John”,
“person_last_name”: “Smith”,
“person_date_of_birth”: “1984-02-15T00:00:00Z”
}
Solr output:
{
“person_first_name”: “John”,
“person_last_name”: “Smith”,
“person_date_of_birth”: “1984-02-15T00:00:00Z”,
“person_date_of_birth.yday”: 46
}
The following provides an overview of the solution:
q=person_date_of_birth.yday:[39 TO 49] (“next 10 days”)
fl=*,days_to_birthday:(sub(person_date_of_birth.yday,39))
sort=sub(person_date_of_birth.yday,39) asc
So far so good. This looks like a viable approach, and it forms the basis of our solution. But the devil is in the details. There are several scenarios that need to be handled very carefully to ensure that every query works 100% correctly. Several of these scenarios are presented below and allow us to outline the final solution step-by-step.
This approach will not work correctly when:
To refresh your knowledge of leap years - here is a Wiki link. To use a non-linear scan approach, we should be using Solr’s range queries to crop the matching docs. We will represent each birthday point as an ordinal ranging from 1 to 365 (or 366).
There is a problem dealing with leap vs. non-leap years when there are 365 vs. 366 days in the year. The calendar#DAY_OF_YEAR scale would work differently. Take a look:
Example: 10th of March from 2014 (69th ordinal) is not the same as the 10th of March from 2016 (70th ordinal). Therefore, we need to make sure that the day ordinals are comparable between leap and non-leap years. One of the possible solutions is to introduce a “fake” leap day which streamlines all years:
This unifies the search logic, and allows for simpler calendar arithmetics. Thus, during ingestion we will see something like this:
So how can we resolve the pesky leap year issue? If the current year is a non-leap year, we will need to join the ordinals of 60 (Feb 29) and 61 (Mar 1). People who were born on Feb 29th, will celebrate their birthdays on Mar 1st in non-leap years.
Let’s consider the following scenario: it is Feb 27, 2016 (a leap year) and we are looking for customers who have birthdays within the next five days:
It’s important to note that even though our query includes a leap day, we have properly compensated for the potential additional day at index time so are comparing “apples to apples”. Let’s call such a leap day, a “real” leap day (because it exists in the real world).
In this case, the function query to calculate “days_to_birthday” would look like (string returned from BirthdaySearchComponent#daysToBirthdayFunctionScore(...)):
sub(
person_date_of_birth.yday,
58
)
In the case of a non-leap year, when there will be a “fake” leap day, we will need to join together ordinals 60 and 61. So it would look like this:
We have to adjust the “days_to_birthday” field calculation, and we will have to expand the selecting query to be “yday:[58 TO 64]” rather than “yday:[58 TO 63]”.
The function-query in such a case would look like:
sub(
if(
ifLessThan(person_date_of_birth.yday, 60),
person_date_of_birth.yday,
sub(person_date_of_birth.yday, 1)
),
58
)
Here, we defined a new function-query ifLessThan(leftArg, rightArg). It is fairly easy to implement it (if in doubt, follow BirthdaySeachComponent#ifLessThan(...)). “60” here is the constant, which defines the ordinal of “fake” leap day (60 == 31 days in Jan + 29 days in Feb).
We need to be cautious around Christmas time because the ordinals are reset after New Years Day, so we need to handle this time period with our Solr query carefully. The overall picture might look like this:
There is nothing challenging about the Solr query itself, but we need to issue a disjunction query that will cover the end of the previous year, and the beginning of the new upcoming year (in Solr syntax):
person_date_of_birth.yday:[364 TO 366] person_date_of_birth.yday:[1 TO 2]
The interesting part is the function query that calculates ‘days_to_birthday’. Now let’s try and understand what that function query would look like. We need to cross the border of “366” -> “1” because sub(person_date_of_birth.yday, 364) works only until #person_date_of_birth.yday <= 366:
Human Date | person_date_of_birth.yday | sub(person_date_of_birth.yday, 364) | Expected Days-to-Birthday |
---|---|---|---|
December 29 | 364 | 0 | 0 |
December 30 | 365 | 1 | 1 |
December 31 | 366 | 2 | 2 |
January 1 | 1 | -363 | 3 |
January 2 | 2 | -362 | 4 |
… | … | … | … |
We could make use of modulo arithmetics (by modulo of 366), and turn all calculations into dealing with positive numbers only:
mod(sub(sum(person_date_of_birth.yday, 366), 364), 366):
Human Date | person_date_of_birth.yday | x := sub(sum(person_date_of_birth.yday, 366), 364) |
mod(x, 366) | Expected Days-to-Birthday |
---|---|---|---|---|
December 29 | 364 | 366 | 0 | 0 |
December 30 | 365 | 367 | 1 | 1 |
December 31 | 366 | 368 | 2 | 2 |
January 1 | 1 | 369 | 3 | 3 |
January 2 | 2 | 370 | 4 | 4 |
… | … | … | … | … |
There is nothing overly challenging composing the Solr query that will crop the ‘documents that satisfy the birthday criteria’. It is going to be a simple range-query against the person_date_of_birth.yday field.
Now let’s try to generalize a formula that calculates ‘days_to_birthday’, so that it addresses both scenario #1 and scenario #2. We will use the example that today is ‘29 Dec’ (or yday=364) and use a random searching criteria (let’s say “all people who have birthdays within the next 100 days”). This example provides us a time period that overlaps both the New Year (1 January) and a potential leap day (29 February).
It is important in the calculation to establish whether we are covering a “real” leap day or a “fake” leap day in the future. In the case of a fake leap day, the formula would look like this:
mod(
sub(
if(
ifLessThan(person_date_of_birth.yday, 60),
sum(person_date_of_birth.yday, 366),
sum(person_date_of_birth.yday, 365)
),
364
),
366
)
And in the case of a real leap day, it would look like this:
mod(
sub(
sum(person_date_of_birth.yday, 366),
364
),
366
)
You’ll always be able to know in advance whether you will or will not overlap fake or real leap days, so only a single version of the function query from above will be applied at the query-time.
Time zone differences do not create a major problem. This is because they simply require that you calculate your ordinal day depending on your desired time zone. For the sake of simplicity, we assume that all birthday dates are indexed, as dates in the UTC time zone. Therefore, in order to make proper calculations, you just need to calculate your ordinal day in a custom time zone. Typically, it is represented as a separate parameter in your query (i.e. &timeZone=...)
Example (birthday within the next 10 days):
This is to address the scenario where you want to adjust the “current day”, and execute your search from a particular day. You can think of it as overriding the system time, and issuing your ‘birthday search’ tomorrow instead of today.
In order to implement a start day offset, an additional query parameter is used (&birthdayOffset=...) to control how the day-ordinal is calculated. This allows you to exclude those entries in your result set that are having a birthday today.
Example (birthday within the next 10 days)
With this approach, we were able to design a viable solution using the Solr search engine. This solution leverages two key features of Solr, range queries and function queries. We can now solve this problem without needing to conduct a full scan, and the solution can be easily scaled within the search index.
While we were able to successfully solve this problem for our customer, their issue was certainly not unique. Many companies run complex, personalized marketing campaigns, and need to easily query their customer database in order to gather lists of customers with relevant milestones. Therefore, our approach has a wide range of useful applications across the retail sector.
For further information, please see the following links: