Pipelines

Overview

The configuration of an intelligent search algorithm can be extremely complicated. Pipelines break down this problem into smaller pieces that can be easily mixed, matched, and combined to create an incredibly powerful search experience.

Pipelines are easily configurable YAML-based scripts that define a series of steps that are executed sequentially when indexing a record (record pipeline) or performing a query (query pipeline).

There are several advantages to pipelines versus the approach most search engines take today:

  1. Each component does one thing, so they are easy to understand
  2. The state is passed from step to step, so it’s easy to build highly complex workflows
  3. Each step can be turned on/off using conditional expressions. For example, personalization can dynamically boost results based on information in user profiles
  4. The highly complex engine query requests are constructed for you at runtime
  5. They can be versioned, AB tested and much more

Pipeline types

Sajari leverages two types of pipelines to provide that flexibility during indexing and querying time.

Query pipelines

Query pipelines define the query execution and results ranking strategies used when searching the records in your collection. Steps in a query pipeline can be used for:

  • Query understanding - query rewrites, spelling, NLP, …
  • Filtering results - based on any attribute in the index. For example location or customer-specific results.
  • Changing the relevance logic - dynamically boost different aspects based on the search query, parameters or data models
  • Constructing the engine query - as opposed to the input query, the engine query is what is actually executed, it can be extremely complex

Record pipelines 

The record pipeline can update and augment information as it is indexed. Steps can include:

  • Data transformation - e.g. trimming a title
  • Data enrichment - generate a lat and long from an address
  • Classification - labeling content with a category based on an existing model
  • Vectorization - clustering uncategorized records
  • Image recognition - detection of objects and faces, read printed and handwritten text, extraction of metadata

Steps

Steps are a unit of work in the pipeline flow that is responsible to perform the individual tasks listed above.

Steps are made up of several components:

  • Constants  -  are used to configure steps. They are fixed and can't be changed at query time.
  • Parameters - Params are key-value pairs that are initialized with the request and are passed from step to step. Each step has the ability to add or modify params, passing them on to subsequent steps. Once the pipeline has been executed, the modified params become available as output values of the pipeline.
  • Conditions - Each step can be conditionally executed based on the input values. Conditions are boolean expressions that can be defined using operators (AND/OR and =,~,>,<,!=, etc) to evaluate the pipeline param values. If the condition is satisfied, then the step will execute, otherwise, it is bypassed.

Pre-steps and Post-steps

Pre- and Post- steps split the pipeline into two parts. One that runs before the request is sent to the search index and another that runs afterward.

Pre and Post-steps in query pipelines

When running a query, the pipeline post-steps have access to the result-set. This makes it possible to act on the result before sending them back to the caller.

Pre and Post-steps in record pipelines

For indexing operations, pre-steps are used to update and augment the record before it is stored in the index. The pipeline post-steps only run when creating new records. They do not execute when updating records.