Spellcheck in Elasticsearch

Categories

After working as a backend developer for several years, the opportunity arose for me to begin integrating features using Elasticsearch (version 7.17). Everything was going smoothly until it was time to take a deep dive into a new feature.

Recently my team was asked to reach a milestone that appeared to be easy, at first glance. The task at hand was to check the spelling of a query. Simple enough, right? Well, it ended up causing us a bit of a headache.

Although I’m not an expert, I really enjoy a challenge and decided to face this one head-on. “Let’s do it!” said my brain.

Searching for Solutions

While researching how to check the spelling of a query in Elasticsearch, I found that the correct way to return options if the query received had a misspelled word was using Suggesters and began reading the documentation.

Technically, to meet the goal set for the team, I just needed to add a Term Suggester. It appeared to fit the requirements, so it seemed that adding it should be enough to spellcheck queries.

Not so fast. There were some suggestions that did not satisfy the user experience, as the Term Suggester returns the suggestions token by token. In other words, sometimes the suggestion returned was not good enough. For example, not all plural words were returned as plural suggestions, and therefore did not provide the most relevant results for the shopper:

Query:

{
  "suggest": {
    "text": "blue truoser",
    "spellcheck-term": {
      "term": {
        "field": "spellcheck"
      }
    }
  }
}

Response:

{
  "suggest": {
    "spellcheck-term": [
      {
        "text": "blue",
        "offset": 0,
        "length": 4,
        "options": [
        ]
      },
      {
        "text": "truoser",
        "offset": 5,
        "length": 7,
        "options": [
          {
            "text": "trouser",
            "score": 0.85714287,
            "freq": 80
          },
          {
            "text": "trousers",
            "score": 0.71428573,
            "freq": 533
          }
        ]
      }
    ]
  }
}

So, I continued reading and figured out that issue was solved by using the Phrase Suggester within Elastic Search.

Using the example above with the Phrase Suggester, the whole sentence is analyzed so that the first suggestion returned provides a more relevant result for the shopper:

Query:

{
  "suggest": {
    "text": "blue truoser",
    "spellcheck-phrase": {
      "phrase": {
        "field": "spellcheck"
      }
    }
  }
}

Response:

{
  "suggest": {
    "spellcheck-phrase": [
      {
        "text": "blue truoser",
        "offset": 0,
        "length": 12,
        "options": [
          {
            "text": "blue trousers",
            "score": 0.008513907
          },
          {
            "text": "blue trouser",
            "score": 0.0066427314
          }
        ]
      }
    ]
  }
}

But again, QA came back with poor results – this time for single queries. The next discovery made was that the Phrase Suggester does not behave exactly like the Term Suggester, even though the documentation states that the first adds logic on top of the second. It seemed that maybe for that reason, the logic added did not work well enough. Here’s an example of a query I used:

Query:

{
  "suggest": {
    "text": "peac",
    "spellcheck-term": {
      "term": {
        "field": "spellcheck"
      }
    },
    "spellcheck-phrase": {
      "phrase": {
        "field": "spellcheck"
      }
    }
  }
}

Response:

{
  "suggest": {
    "spellcheck-phrase": [
      {
        "text": "peac",
        "offset": 0,
        "length": 4,
        "options": [
          {
            "text": "pack",
            "score": 0.029085064
          },
          {
            "text": "peach",
            "score": 0.02825284
          },
          {
            "text": "pearl",
            "score": 0.009542209
          }
          {
            "text": "peace",
            "score": 0.007914039
          }
        ]
      }
    ],
    "spellcheck-term": [
      {
        "text": "peac",
        "offset": 0,
        "length": 4,
        "options": [
          {
            "text": "peach",
            "score": 0.75,
            "freq": 280
          },
          {
            "text": "peace",
            "score": 0.75,
            "freq": 25
          },
          {
            "text": "pack",
            "score": 0.5,
            "freq": 780
          },
          {
            "text": "pearl",
            "score": 0.5,
            "freq": 59
          }
        ]
      }
    ]
  }
}

In most cases, the spellcheck of a query is satisfactory with the results of the Term Suggester, but if the query contains more than one word, the best option is to use the Phrase Suggester.

Finally, case closed! The only thing left was to choose the suggester depending on the length of terms in a query, in order to solve the problem.

Enhancement

Later on, we wanted to go a bit further with the results returned when searching for the suggested word. What if there were a way to know beforehand whether a suggestion that is returned will have results or not?

Elasticsearch had the answer: use collate.

Unfortunately, there is a problem in that collate only can be used with the Phrase Suggester. The solution? Apply collate only if the Phrase Suggester is used.

Conclusion

After all of the research and experimentation laid out above, my team now knows the best possible approach to meet shoppers' requirements: Term and Phrase Suggesters in Elasticsearch. The next step is to further examine the results to try and increase efficiency, but that will be my next challenge!