Below is a recent presentation at the Go Sydney meetup on different types of binary encoding. Go as a language is core to Sajari's technology and binary encoding is one of the key areas in which we've done a huge amount of optimisation, as it is fundamental to building a very fast search index, which is the backbone of our technology.
Unlike other search technology, we are focused on building a fully real time index, as opposed to providing read-only index "segments", which are immutable, diverge and are periodically recombined. This has interesting impacts on some of the low-level encoding techniques and trade-offs.
Our index data is memory mapped and both readable and writable always. This is technically an interesting challenge given records like documents are not fixed length but actually vary greatly in size, thus our encoding to disk and memory mapping is managed in some unique ways that allows some great features like:
- Read full or partial records in place without decoding (e.g. directly from memory mapped file)
- Increment things like counters in place (this makes Sajari closer to a database in some aspects)
- Index compaction is never required, even though non-fixed length records are added/deleted constantly
- Data co-location can be optimised for memory paging efficiency
Binary encoding is core to those features, and we need it to be fast. We've spent some time comparing different encoding techniques and have collated those into our talk. You can find the updated slides here, which include finishing off Flatbuffers full encoding/decoding of examples including nested vectors of structs, which was not completed at the time of the video.
Many thanks to Owen Prime for early work on our internal binary encoding, Alec Thomas for his very useful Go serialization benchmarks and Wouter van Oortmerssen and Robert Winslow for their assistance with Flatbuffers.