What is query understanding?
Query understanding is the process of analyzing search queries and translating them into an enhanced query that can produce better search results. It’s one of the most important keys to great search experiences.
What are some examples of query understanding?
Query rewriting, synonyms, spelling corrections, classification, NLP, vectorization, bigram, and trigram detection for query segmentation, semantic query understanding, personalization, localization, and query scoping (attribute mapping).
How important is query understanding to great search results?
Absolutely critical. People make a lot of spelling mistakes and language is also inherently ambiguous (“test”, “mobile”, “apps”, “summer”, “north” are all names; “Jaguar” is a car, an operating system, an animal, etc.). People also use slang and mention things that aren’t necessarily ever mentioned in result item text at all (e.g. “size 14 shoes”, “near a park”, “next Thursday evening”, etc.). All of this needs to be translated into something meaningful that can better query the underlying data structure.
In addition, query understanding can also be used to prioritize results. For example, historically, when people search for “license” data may show they are most often looking for the license renewal page - not the other hundreds of pages mentioning “license”. This historical performance data can automatically elevate the value of the preferred destination, even if there is no indication otherwise in the result itself that it is most important.
Conditional boosting is a powerful tool to use historical data to drive highly performant boosts at an individual query level
Query understanding is often a great test of search technology also. What if you need to map “size 14” to a size attribute, or “next Thursday” evening to a time filter, or automatically spell correct queries? Can your current implementation do this? If not, then you are frustrating your customers and also falling behind.
Where should I begin with query understanding techniques?
Many highly useful query understanding techniques belong to the family of query rewriting, which aims to modify the query to represent the query intent better and thus improve overall precision, recall, or both. The order these techniques are applied can significantly impact the query outcomes as well as the processing complexity. Below are some of those techniques most commonly used.
Synonyms and spell correction are universally useful. Wherever text is involved, these will have a positive impact. Typically these are embedded into search service offerings, but you may have to configure these yourself, particularly for industry-specific jargon.
Synonyms are fairly straightforward. They are typically run as an alternate reality, so if someone searched for “gigantic shoes” this may expand to “(gigantic OR enormous) shoes”. This is pretty simple but can be more complex if working out which to prioritize. It can also be even more complex if synonyms are generated using machine learning, e.g. via word embeddings.
Spell correction can work in different ways. One approach is to correct words that don’t exist to come up with a single likeliest replacement and then allow users to override it. This is how Sajari used to function, but we noticed “the best” alternative was not always correct (language is ambiguous remember) and the penalty for being wrong was very high, as people may lose confidence and give up. Note: Google does this well as they have virtually unlimited training data, but this is much harder for the average business.
Spell correction is a powerful query understanding tool that is deceptively difficult to execute well
Another approach to spell correction is to look at every possible variant and run them all. This can explode very quickly and slow down your queries a lot. There is search technology available that does exactly this, but to retain query speed, they have to bake the prioritization of the ranking algorithm into the data structure itself at the expense of query flexibility. Depending on your application, this may or may not be a good tradeoff.
Our most recent spell correction technique (as of this writing) is a hybrid that looks at alternatives but can weight them efficiently in terms of likelihood of the users intent using past query history and result field data. We then run only the top X most probable combinations. This is very effective and is trained from not only query data, but also result items themselves. For example, product names, titles, etc. so it will understand your data immediately and in any language and then continue to improve over time with more queries.
Semantic query understanding
Semantic query understanding is the process of actually trying to understand the intent of queries. Language is very ambiguous by nature and polysemy aptly describes why this is such an issue for search: “poly” meaning multiple and “semy” in this case meaning senses, or meanings. “Bank” is a classic example of this, does this mean a financial institution or the side of a river? Without added context, it is difficult to know. English, in particular, is littered with examples. Luckily there methods to handle this. For queries with multiple words, the context is typically more obvious. For single search terms, it is more difficult, but in these cases, the past query sequence history can be useful. For example, if someone searched for “atm” and then searched for “bank”, it would be unlikely the second query was to do with the side of a river!
Classification can also be used to understand the type of query intent. This is most useful for query boxes that search datasets with multiple distinct data types. An example is LinkedIn, where you can search for companies, people, jobs, etc. The historical patterns for each query can assist with predicting the most likely result type. Ecommerce is another example, where a support query is different from a product search, for example, searching for “return shoes” should probably not show shoes for purchasing.
LinkedIn is a good example where the search intent can be highly ambiguous. Data is key to getting the intent right more often than wrong
Natural Language Processing (NLP)
NLP is the process of analyzing unstructured text to infer structure and meaning. Structure, in this case, is referring to information that is highly defined, for example, a category or a number, much like fields in a database. It can also represent relationships between things. Common examples include sizes, colors, places, names, times, entities, and intent, but there are many more. NLP is most valuable when the underlying data has a lot of structure that can be mapped from the queries.
Natural language processing (NLP) is getting closer to the human level interpretation of written language
Query scoping can make ordinary text search appear highly intelligent. This technique attempts to find structure within the query that doesn’t necessarily match unstructured text (reverse indexes) but instead maps directly to structured data attributes.
An example may be the query “black size 14 basketball shoes”. Product information is highly unlikely to mention the text “size 14”, but the sizing is highly likely to appear in a list of sizes on relevant products. “Black” as color may be in the description, but may also map to a more specific color attribute. So this query may actually remove this text and query only for “basketball shoes”, but at the same time filter the result set to size=14 and color=black. This dramatically increases precision, and in the case of a strict AND based search can also increase recall.
The downside of the above technique is for understanding queries with potentially mixed meaning. For the example query “size 6 nike”, the sizing may refer to shoe size which could be men’s, women’s, kid’s, US, UK or other, a shirt size, bra size, or even a basketball size! Over-filtering can cause a significant loss of precision.
Despite the potential issues, the upside far outweighs any downsides. The typical approach to implementing this technique is to remove specific text identified from the query and convert it into a structured operation (i.e., filter or boost). This allows people to use natural language to describe what they want and have it transitioned into something much more meaningful than the original unstructured query.
Lastly, if a structure is not present on your records that doesn’t mean you can’t generate it yourself. Index time analysis and data extraction can be a powerful tool to add structure to your information. Clustering, classification, topic modeling, tagging, and entity extraction are just some of the powerful techniques available.
Query scoping allows text to be extracted from input queries and mapped directly to structured attributes
Vectorisation is the process of converting words into vectors (numbers) which allows their meaning to be encoded and processed mathematically. This underpins language translation and many other amazing applications. The magic in the vectors is they can also be added and subtracted, so the meaning of the text can also be added and subtracted. In practice, vectors are used for automating synonyms, clustering documents, detecting specific meanings, and intents in queries and ranking results.
word2vec illustrates how language can be converted into mathematical vectors that retain meaning. It's even more surprising the context can be added and subtracted mathematically that retains meaning
Query segmentation is the process of identifying sequences of tokens that mean much more together than they do individually. This assists in improving precision by not returning results for partial segments. Two and three-word phrases (bigram and trigrams in this context) are useful in that they can potentially have much more meaning than individual words (unigrams). For example, take the phrase “new york”. On their own, these two words don’t mean much, but together they obviously do! So instead of treating them on their own, they can be combined into a single term which can even have its reverse index. We could look up the indexes of both terms and work out when they are in sequence to derive the phrase matches but a) this is more complex and b) we possibly don’t want people searching for “new” to get results with “new york”. Also, if people search for “new” it is possible to match results containing "new york". This is also problematic for reinforcement learning where we don’t want machine learning to penalize partial matches unfairly. Hopefully, this hasn’t lost you, but the key point is maximizing information context.
Personalization is the process of adding additional information to a query based on the individual that is searching to change the results to be more relevant to the individual. This can be as simple as a clothing size preference, gender, location, or any other individual characteristic that can influence their results.
Typically the first stage of personalization is sending the information through with the queries to be recorded with other analytics. The performance impact can then be analyzed offline to determine if this is useful for purposes. It should be noted that this is useful in its own right from a business intelligence perspective, even if it is not used for personalization.
Often queries can be layered with personal information to augment and influence queries to produce highly personalized results
Localization is the process of using the searcher's location to enhance results. Meetup is a good example of this in action. It can also be extracted from the query text itself (see NLP above) as is typically done for real estate and job search engines.
IP detection locates me incorrectly at the ISP headquarters over 60 miles away!
Although be warned it can cause issues. For example, geolocation in Australia is less than 80% accurate, so you will put a big number of people in the wrong random city, possibly even over 1000 miles away. These quirks are usually navigable, but design with caution and always account for the UX flow assuming the location is incorrect. Meetup.com is an example where the location is so critical to the search that it's now deeply integrated into the search UX. It's often worth thinking about these the cost of a wrong guess. Query understanding is never going to be perfect, so often it's better to make it easy for the user to self-select.
Meetup.com integrates location selection directly into the search bar
Voice-based queries are on the rise, and that’s only further increased the need for advanced query understanding techniques, mainly because web search originally trained us to shorten queries (more text = fewer results), but the voice has since increased the query length and included much more structure in the query. Ironically voice translation may get words wrong, but the spelling is perfect, so this raises other challenges.
Query understanding tips
Query logs are your friend and should help you to set the priorities for query understanding. For example:
- Zero result searches with high volume point to opportunities. Why are they failing?
- If no one is using natural language and/or you don’t have nicely structured information, then query scoping may not be that useful.
- If business language is not translating well to customer speak, synonyms may be useful.
- Do your customers make spelling mistakes and how frequently? If they make lots of mistakes then corrections will undoubtedly help.
Advanced techniques like vectorization and personalization should come last as they require much more effort. While personalized search seems like the holy grail more often than not, there is much more value to be extracted by getting the basics right. After all, personalizing bad results is not going to be useful.
Sajari has modularized many of the techniques above to make building powerful search applications a highly efficient process. To learn more or discuss your application, please book a demo or say hello.