Build a high performance, realtime search engine with hybrid database-like qualities. It must be fast, handle very high dimensional queries, integrate machine learning and work either single node or distributed. Although the focus is to perform well with complex queries, standard keyword queries, facets, filtering and standard search features are also required (in fact the nice byproduct is they are ridiculously fast!).
Search has quietly been going through a major increase in complexity over the 10 years, but strangely the same old technology is typically under the hood of almost all popular search engines today. Queries have progressed from simple keyword searches to complex "matching algorithms", which look at language structure, meaning, entity relationships, your interests, your location and much much more. This difference is subtle but from a technology approach, it's fundamentally different. We tried very complicated match queries across large datasets with standard search technology and they failed miserably. The design of matching technology is a simply a different problem. It's also computationally much, much more difficult.
Real time matching across large, fast changing datasets was our goal, and we tested until we knew it was possible. We believe that search and recommendations are intertwined, they are not meant to be separate systems, both are matching systems.
The typical approach to fixing slower search indexes is to add a distributed layer and add more hardware. That makes pretty good sense, particularly for search queries hitting a handful of indexes, but with highly dimensional queries this begins to run away very quickly, as the % of indexes interrogated for each query becomes significant, it's also often a unique combination (thus most queries can't be cached). An application, such as this, distributed with a relatively low number of documents per node will scale very poorly. Correspondingly we spent a lot of time writing our own search index from the ground up, specifically for speed.
Typically search engines keep indexes for each text term, which together essentially form a columnar based storage of a sparse matrix (terms vs documents). As the number of terms in the search query becomes large, the typical approach shifts from term based indexes to a reduced dimensional vector space representation (LSA, etc). This approximates the query in a concept space (dense matrix), but loses keyword based information. One of our goals was to keep the individual term indexes, but also to supplement them with higher order information representing the concept space. This allows search, match and recommendation queries to be powered from the same index. The index itself is fast enough to calculate a full bag of words (BOW) cosine similarity of one document (the query) to millions of others sub-second on a single node.
Search indexes are typically immutable (read only) for speed. Strangely people build "realtime" applications on this storage, where every change is stored in a differential and periodically merged to new index files on disk. This makes sense for various reasons, but doesn't really suit data that is being regularly modified. This type of search index is being used everywhere as a hybrid "realtime" database, even though the consequences from a memory, CPU and IO standpoint are brutal.
Read-only search indexes trying to be realtime databases are amusing, but it's easy to see why this is attractive once database to search index synchronization is considered. Data synchronization is painful. If the search index is to be realtime and always up to date, then every database update must be instantly pushed to the index, this is essentially like using the search index as a database, but also maintaining a database. If the update frequency is not so important, this means periodically creating and replacing indexes. Both options require significant technical investment, so many prefer to simply use a search engine for both and pay the penalty of the extra hardware required.
The Sajari storage engine was designed to exhibit database tendencies, but provide an extremely fast search index that is automatically synchronized to the data. Updates do actually update records, indexes, etc in realtime like a normal database. Objects can be added, retrieved, deleted much like a key-val store, indexes can be added on specific fields, numeric fields like counters can be atomically incremented, list fields can have items appended, etc. But the search indexes are automatically modified to reflect these changes at the same time.
Read-only indexes are great for reading. Data never changes, locking is not required, compression is easy and hence IO overhead can be minimised. Proper realtime indexes that act more like databases are very different, some locking is required to ensure data modification does not occur while readers are reading. This was a difficult goal. Search indexes were typically immutable specifically to achieve consistently fast read speeds, so this was a big change in approach. Accordingly it was critical to use non-blocking algorithms to limit the locking required, this is a huge part of how Sajari keeps very fast index read speed even during concurrent writes. The vast majority of writes using Sajari require zero locking.
One of the strangest things we've seen is people building databases and search engines on open source binary encoding packages! Getting data on and off disk quickly is key to designing a high performance search engine or database. That much is obvious, but less obvious is the size of data on disk and the garbage produced (for garbage collected languages) during encoding/decoding. We've tested our own encoding against the fastest encoding libraries out there, the best performing one uses 3x more disk space and is more than 10x slower. Aside from this, we allow zero-copy reads and writes of encoded data, so garbage production is zero for many operations.
Early on we realised TFIDF, BM25, Information Gain and other statistical term distributions could be radically improved using feedback. This is one of our key differentiators. Each Sajari search index learns and gets smarter over time. The index scoring (locally and globally) is floating instead of fixed. We can only do this because we specifically avoid immutable indexes and allow non-blocking data updates as per above.
We've also added machine learning models like Naive Bayes, Random Forests and Neural Networks into the query analysis and match scoring of our engine. Input documents and queries can be automatically classified. Specific text, language patterns, concepts and entities and more can be automatically extracted. By integrating these features, they can be processed inline without noticeably slowing down queries and document addition, which was part of our goals.
Models also play a role in match scoring and ranking. So for instance the result score may be a multivariable regression including many features. And much more...