Sajari is not a fully open-source product, but we have benefited immensely from open-source projects and accordingly we open source stand-alone components where possible. Below is a list of our open-source projects, we also have a range of SDKs and other minor projects, feel free to browse our github repositories.
docconv is a document extraction tool for turning various document formats into text. Currently it supports PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (optionally supported via gosseract extension).
It supports being operated as a standalone service (Docker and Appengine flex versions also), or as an executable, or can be imported directly into another Go pkg. Optionally readability is also supported when dealing with HTML.
We use this pkg to automatically convert inbound documents to text so they can be used as queries, or added to collections.
env is a configuration management tool for Go based services. It streamlines the way service configuration is managed to make it easier for multiple engineers to collaborate/test locally and deploy into Kubernetes clusters.
We use this pkg to simplify our engineering workflow. Env makes configuration enforcable, portable, sharable and generally easier to transition from testing into production.
fuzzy is designed to solve two problems: spell checking and query autocompletion. It was originally just for spell correction, but later extended to assist with query autocompletion. Autocompletion often requires some spell correction or fuzzy matching, so these complement each other well, but can also be used solo.
fuzzy is written in Go and is consequently quite performant. Internally we wrap it with some additional logic to extend queries with multiple terms, multiple mistakes, etc. The wrapper also allows us to deploy as a stand-alone microservice, but the package itself is a building block.
regression is a multivariable linear regression pkg written in Go. It supports both training and prediction from models produced. R2, variance, residuals and coefficients are also available.
We use this as part of our analysis to predict which variables influence search and matching results for a given set of training data.
The pkg contains both a server and a client, or it can be imported directly. This allows models to be loaded, served and queried from external pkgs, via the command line, or directly from inside other programs. It supports the creation of expressions, so it's possible to add and subtract words, find similar, etc.
We use this pkg in many different ways. We have models for key application spaces used to predict the meaning of text, we also use this for synonym generation, clustering of text and the creation of document vectors for matching algorithms.
storage is a Go package that abstracts file systems (local, in-memory, Google Cloud Storage, S3) into a few interfaces. It includes convenience wrappers for simplifying common file system use cases such as caching, prefix isolation and more!
We use this pkg internally extensively as it helps to abstract away complexity from the file storage layer of our applications. For instance we often use a fall back caching mechanism that will first look for a file in memory, then on local disk, then fall back to a redundant block store (e.g. Google CloudStore or Amazon S3). This helps to create fast applications that are also very resilient to failure. The interface is also very extensible, so we also use it to automatically create file paths based on hashes, meta info and more.