In the past search indexes have typically been immutable. Take Lucene for example, the open source search index underpinning ElasticSearch, Solr and others. Lucene indexes are created and saved, then aside from flagging deletes in a special "deletes file", they are never updated, but instead merged and re-written into new files periodically as the data state diverges. Sajari is essentially the opposite, indexes are live, realtime and always reflect the actual current state (it's an eventually consistent model, but master writes are virtually instant). This post is a brief discussion on the differences and trade-offs and why we've chosen the different path.
People use search indexes because databases are not designed for search style queries at scale. Note I'm not talking about your wordpress blog, at this level the database will do fine, even with wordpress moving at it's typical glacial pace. But as the data and queries scale and become more complicated, search indexes will far exceed database performance and also keep these queries from loading your database. The problem with coexisting a database and search index then shifts to data synchronisation.
Synchronisation is painful. When planning a new search integration, this is typically the biggest pain point. It's so evident these days that even Lucene based search indexes like ElasticSearch are now being used as hybrid databases themselves (e.g. there is no database). This raises an issue though, if you want to do something like incrementing a counter (e.g. change a product price, inventory level, etc), what is the actual underlying data flow to achieve this? For databases this is often a row lock, some bit flips and it's done. For immutable searches indexes though, often the the whole document needs to be deleted and fully reindexed in a new segment. This overhead is extreme and unworkable if many smaller writes like counter increments are needed. In this case update flexibility has been traded for the benefits of immutability, which incidentally are many!
Immutable indexes have many advantages, they are a very logical choice and also in many ways much easier to implement. Some advantages of immutable indexes include:
- They do not require locking and can thus be read by multiple readers simultaneously
- The differential adds some overhead, but in general the reads should be very fast
- They do not require index compaction, the merge can replace compaction and run in the background
- Compression and immutability can be used to take advantage of IO caching
- Known data sets can be further compressed (e.g. delta compression, etc)
- For replication, whole segments can be copied knowing they won't change during copy
These are great advantages, particularly for data sets that don't change. Concurrency is a breeze when updates are non-blocking, so immutability is no doubt designed for consistently high read speed. In many cases this is the right trade-off, but not always.
The core goal of realtime, mutable search indexes like Sajari is to enable the search index to act more like a database and allow in place reads and writes, but keep the key advantages that search indexes exhibit when compared to databases. This is a great design goal, but very difficult with concurrency and non-blocking data access in mind.
Some key advantages of realtime search indexes:
- Can read records in place without copying data (e.g. zero-copy)
- Can write records in place directly
- Record attributes can be updated without needing a full delete and reindex
- Non-blocking write-aheads allow appends to still be fully non-blocking
- Differential lookups and index merging are not required
Sajari uses realtime indexes, it's own data layout/flow and it's own binary encoding methodology (you can read about it here), which is between 10-1000 times faster than available encoding packages and uses less than half the space. The benefits of zero-copy encoding/decoding cannot be understated. Zero-copy reads have no intermediate data copies, which essentially means a) reads are extremely fast and b) there is no garbage generated during reads. This keeps the Sajari index very fast, but also allows updates to remain extremely inexpensive when compared to other search technology.
Lucene was originally designed way back when periodic index creation made sense and the gap between search indexes and databases was broad. In contrast Sajari has deliberately chosen to be closer to a hybrid database by maintaining a realtime index. The general consensus internally at Sajari is that data synchronisation will continue to be an enormous pain point and more hybrid search-database engines will appear in coming years to completely bypass the issue. We've spent a lot of time on our indexing structure to keep it extremely fast, yet still allow differential writes at very low cost. We're now very close to releasing our 10th major index version encompassing our new, faster encoding and an all new non-blocking concurrent architecture.
So, what if your search index was more like a database? We'd love to hear your thoughts!