According to a Forrester Research from 2020, 43% of users immediately head to a website’s search box on their first visit. However, many businesses seem to underestimate the importance of a powerful search experience.
One of the challenges is that as your website or database grows, it becomes harder to make sure that every search query leads to a successful purchase, a new signup, or more visibility to your content. Building a search engine in-house that suggests tangentially related topics, has contextual awareness, or allows you to promote specific results is a huge undertaking.
This is where external search engines come in.
Integrating with a third-party search solution can help you make sure that when users are looking for information, they not only find the most relevant results, but also get suggestions for similar content or expose them to options they may not know they need. The best solutions offer this functionality while staying flexible, scalable, and fast.
There are a number of search engines you can plug into your website, but one of the most popular is Elasticsearch, a RESTful search and analytics engine. Released in 2010, Elasticsearch is a Java-based API built on Apache Lucene and is capable of searching many data formats, including structured and unstructured data.
Elasticsearch indexes your data using keywords, making search queries fast as it searches through the keywords rather than searching full text (also known as an inverted index). It has a lot of benefits including plugins and libraries for many programming languages, a robust REST API, typo tolerance, and ranking and sorting capabilities.
Up until 2021, Elasticsearch was open source software under the Apache License, but has been changed to use Elastic License and Server Side Public License (SSPL) due to issues with Amazon’s usage of the software.
As great as Elasticsearch is, you can’t completely ignore its limitations. It’s notoriously difficult to set up, requires dedicated engineering resources, and is not suited to handle dynamically changing data. This makes it a poor fit for social and e-commerce site searches.
In this guide, we’ll look at Elasticsearch’s architecture in more detail and dive into these limitations. Finally, we’ll look at Sajari, an AI-powered alternative that overcomes some of these weaknesses.
Basic Concepts: Cluster Architecture
To understand Elasticsearch at scale, you need to grasp some of the underlying architectural patterns it uses to store and index data.
First, a node is an instance of Elasticsearch. Each node usually runs on a single machine and communicates over a network, sharing read/write responsibilities with other nodes in its cluster.
A cluster is a collection of Elasticsearch nodes that communicate to read and write to indices. In a typical cluster, there’s a master node that organizes the communication between nodes, helping maintain consistency throughout the cluster.
Since a cluster is made of many nodes—each running on a single machine—you can scale Elasticsearch reasonably well by adding more nodes to the cluster. This is called horizontal scaling.
Finally, you need to understand indices and shards. Indices are each a Lucene index and are made of shards scattered across the nodes in a cluster. The shards are replicated across the nodes, so if the node holding the primary shard is unavailable, the replicated shards can be read instead. This adds redundancy and scalability to Elasticsearch’s architecture and makes it highly available. Shards can be configured to refresh automatically to offer near real-time search.
A schema includes all the fields in a document and a description of datatypes.
Using Hosted Services
You can host Elasticsearch on your own servers or virtual machines, or you can pay a third-party provider to do it for you. Two of the most common options are Amazon’s Elasticsearch Service, and Elastic’s own Elastic Cloud.
Introduced in 2010, the Amazon Elasticsearch Service gives you a fully managed Elasticsearch instance that makes deploying, securing, and running the search engine somewhat easier. It also integrates with LogStash (which allows you to publish logs to Elasticsearch) and Kibana (which provides a graphical visualization interface for your published data). This collection of tools is referred to as the ELK stack.
The introduction of Amazon Elasticsearch Service came two years before Elastic joined forces with Found, another cloud offering for fully managed Elasticsearch. Found would later in 2015 become Elastic Cloud, now delivering hosted Elasticsearch as well as hosted Kibana. Finally in 2017 Elastic Cloud Enterprise was released, allowing businesses to download a version of their hosted solution, but run it themselves.
These hosted services give you a starting point for hosting Elasticsearch, but it still requires a fair bit of engineering time to implement even basic site search capabilities. For larger enterprise search use cases, it requires work to set up and maintain an Elasticsearch cluster for uptime and reliability.
After setting up your nodes and clusters, you need to upload or stream your data for indexing. Typically, you can use cURL or a scripting language that allows you to send HTTP requests to upload your initial data as JSON. Be careful with large datasets, though, as network instability could corrupt or stop streams. Setting up a queueing system is probably a good idea as it will let you track records that are incomplete and pick up where you left off when something goes wrong.
The other big challenge in using Elasticsearch is indexing. Determining how to index your data is left largely up to you, so you’ll need to figure out how to index and shard it based on your use case. This ends up being a largely trial and error-based process for most teams.
Ultimately, even a hosted offering—which is designed to simplify setting up Elasticsearch—isn’t really that simple. Plus, you can’t ignore the costs.
Amazon’s prices range from $0.25/hour to $7.987/hour, depending on the storage optimization options you select. This cost will grow as you scale up (increasing hardware resources) or scale out (adding more nodes, which also means increasing hardware resources). The case isn’t much better with Elastic where it also starts as low as a quarter of a dollar, but as your deployment scales your pricing definitely does as well.
Downsides of Elasticsearch
Even with a managed option like Amazon Elasticsearch Service, Elasticsearch is still pretty difficult to set up. Besides understanding the architecture and how you’ll upload your data, you need to manually configure Elasticsearch and your search indices. This means that you’ll ultimately invest plenty of engineering resources to manage, maintain, and optimize Elasticsearch if you go this route.
One of the other concerns around using Elasticsearch is the license. When it changed its license in early 2021, it caused an uproar in the open source community.
In their statement, Elasticsearch claimed that they’re still committed to open-source search, despite the license change; however, the SSPL license has been rejected as an open source license by the Open Source Initiative. Elasticsearch made this change to "restrict cloud service providers from offering our software as a service,” but this ultimately violates the OSD6.
From a business perspective, one of the most significant downsides to Elasticsearch is its inflexibility in handling dynamic data. As stated before, Elasticsearch groups records by keyword and searches all the indexed keywords when performing a query. However, stored indices are never actually updated. That’s fine for log analytics or data that never changes, but it’s challenging for more dynamic content.
So, let’s say you make a small change in an item’s description. Regardless of how big or small the change, instead of updating the existing index for the item, Elasticsearch gives that item a “deleted” flag and then stores the item again as a new index. As real-time updates are hard, Elasticsearch batches the updates, paying a much larger cost to merge and reconcile. If the updates to the data happen frequently, Elasticsearch will spend a lot of time merging indices compared to updating incrementally.
Although Elastic is an analytics engine itself, if you’re using it for applications such as e-commerce search, it does not by default record what’s working and what’s not, and there’s no way to apply that information to make it improve over time. The technology is fundamentally a product of engineering configuration, which drastically underperforms newer machine-learning-optimized solutions.
When it came out, Elasticsearch was the perfect solution for most sites that needed search functionality. However, with time, it’s proven that when it comes to dynamic data, there are other solutions that can improve the search experience and better handle datasets that scale.
Sajari is a user-friendly search platform that combines the power of full-text search and database search, including searching through PDF and DOCX documents. Sajari provides a blazingly fast search experience with minimal latency as it uses real-time indexing and binary encoding. This creates and interprets sequences of bytes in standardized ways, making searches faster while requiring less storage space.
Flexible, Simple Configuration
Perhaps the biggest difference for engineers is in how easy it is to configure advanced search functionality with Sajari.
Configuring Sajari can be done through simple YAML files that define pipelines. Each pipeline outlines a series of steps that are executed when indexing and querying records. Using this series of easily understandable steps, highly complex search queries can be planned efficiently.
Pipelines even allow you to use conditional expressions, A/B test different search algorithms, and make changes in real-time without actually having to reindex the data.
Sajari optimizes search results through reinforcement learning, which uses the feedback from previous searches to improve future search results.
This works best for large datasets, as it is statistically driven and reorders data based on randomized probability. Within days to weeks (the process can even be sped up with manual tuning), Sajari’s reinforcement learning will provide the best search results for each user, with a better understanding of what they like, what are the most relevant search results for them, and more. This can lead to an improved click-through rate (CTR), more signups, and ultimately more revenue for online businesses.
Sajari also provides an easy-to-use, drag-and-drop Search Interface Builder that allows your non-technical team to customize the look and feel of the search experience without heavily involving your developers.
Sajari also provides UI components for your website, libraries for different frameworks/programming languages (React, PHP, Go, and more), and a REST-like API that allows you to connect your data from any source.
RESTful APIs and SDKs
More like a database
Unlike Lucene-based search engines, Sajari works more like a hybrid database by maintaining a realtime index. Dynamic or frequently changing product information is updated instantly. The core goal of realtime, mutable search indexes like Sajari is to enable the search index to act more like a database and allow in-place reads and writes. This keeps the Sajari index very fast, but also allows updates to remain extremely inexpensive when compared to other search technology.
When it comes to static content, like log analysis, or building an analytics engine, Elasticsearch is a great solution, assuming you’re okay with the licensing. It’s a powerful choice for immutable data, well-known and understood, and thanks to hosted platforms, your engineering team can offload some of the setup and maintenance work.
That said, when it comes to dynamic content—like e-commerce websites—Sajari is an ideal solution. It understands what users need and provides the results that will make them want to take action. With a fast, scalable search experience using the reinforcement learning described above, customers will get the best results that also contribute to the bottom line. You can do some of these things with Elastic (not all), but the effect will vastly outweigh the cost of a solution like Sajari that will ramp conversion out of the box and out of the gate.
About the author
Guest contributor Shahad Nasser is a full-stack developer with expertise in web development. She also loves writing technical articles, as they help her learn, become better, and spread her knowledge in the hopes that it will help someone out there. Follow her on Twitter.