## API clients

We support a range of API clients, mostly built around our gRPC definitions. gRPC can also generate SDKs in other languages not listed here.

# Documentation

### Terminology

To begin, we need to explain that we refer to an particular search index as a collection (i.e. a collection of records that make up the index). Accounts in the system are called projects.

• collection a schema and collection of records which make up a search index.
• schema definition of fields which can be defined on records, including properties and constraints that can be applied fields.
• record field-value pairs (i.e. structured key-value pairs) representing an object stored in the search index, analogous to a row in a database.
• field is a sub-component of a record, analogous to a column in a database.

# Collections

## Schema

A schema is a list of fields which are valid record fields for a collection.

Name
The name of the field. This uniquely defines the field.
Type
Type of data stored in the field (STRING. INTEGER, FLOAT,TIMESTAMP).
Repeated
The field contains a single value or repeated values of the same type.
Indexed
Whether the field should be query-able in index-based queries.
Required
Whether this field should be required (i.e. set) for all records.
Unique
Values in this field must be unique (i.e. they uniquely define a record within a collection).

### Types

Name Description Example
STRING String of text "I'm the walrus"
INTEGER Integer value 42
FLOAT Float value 12.34
TIMESTAMP Date/time value 1234567890 (UNIX timestamp) or 2009-02-13T23:31:30+00:00 (RFC3339)

## Records

A record is a set of field-value pairs of data that can be queried in a collection. The database equivalent is a “row”, in information systems it could be referred to as a “document” or “object”.

New records can be added to a collection. Adding a record stores the field information in two ways:

• All fields not marked as non-stored are added to meta storage.
• Any transforms associated with the add request are applied to the fields to produce a set of terms, for any fields marked as indexed in the schema, these terms are added to reverse indexes.

Notes:

• transforms are applied sequentially, order is important.
• by default a splitter and english stop word stemmer are applied
• different transforms can be used for different languages and various other requirements

### Patch Record

Allows stored records to be modified.

### Delete Record

Allows records to be deleted.

### Terminology

• body is a weighted free-text component of a query.
• term is a tokenised piece of text, e.g. a word or number.
• intersection describes where terms and records intersect, e.g. the term exists in the record.
• index is a list of intersections. A forward index describes the terms in a particular record. A reverse index describes all the records which contain a particular term.
• index query is a query component that evaluates the indexed text from each result record. This deals with term - record intersections.
• feature query is a query component that evaluates features extracted from each result record.
• instance boost changes the importance of a term - record intersection instance based on specified criteria.
• feature boost changes the importance of an extracted feature based on specified criteria.
• filter a rule which is evaluated against the fields of a record.
• sort allows the results to be sequentially sorted instead of default score based sorting.
• offset return results beginning with the one at offset.
• limit is the number of records to return in a response.
• fields which fields are returned with each record in results.
• aggregations are used to build aggregated analytical information from a set of search query results.
• transforms modify queries in various ways. They can create new information or augment existing query information.

### Basic search concepts

Searching follows a basic flow:

1. Identify records to include in the result set
1. Apply boosts to adjust rankings

The first step, identifying records to include in the result set, can be a combination of:

• Keyword-based queries: free-text, pre-tokenised text, include records with indexed fields which contain keywords/terms.
• Filters: only include documents which satisfy some field-based rule.

A search is made up of two main components, the index query and feature query. These two can be used in isolation, or together. Both can be boosted based on the records other fields.

Keyword based searches start with the index query, which deals with traditional inverse index lookups (i.e. query matches in documents). The initial index query score is generated based on the overlap of the input query terms and the records indexed fields. Where as the feature query deals with creating isolated features from the record that each make up set percentages of the result score.

Matching queries mainly utilise the feature query component, as this allows many components to be created with set weightings to create a cumulative match score.

Queries including an index query component are very fast and efficient as they immediately filter the potential result set to those with matching terms.

#### Why split into two components?

Put simply, these two components make the scoring system very, very flexible. Some scoring components are multiplicative with respect to index intersections, others are additive (focused on field information) and ignore the index calculation altogether. Simplistically this can be thought of as:

Search applications tend to focus on the index query, where matching text is critical to scoring results. Any boosting on the index query is multiplicative. An example of this is weighting fields differently, for web content it’s common that a “title” match is more important than other fields, a “product name” is more important for ecommerce, etc.

Matching applications tend to use many feature query components to create their own match score. Machine learning can help determine what each of these components should be worth. By allowing these to be weighted in the query, this also allows applications to experiment to optimise weightings. Whereas pure search applications may not use any feature query components at all, but it may be more important for keyword matches to occur in certain fields (e.g. “title”).

#### Scoring the two components

The score for each result document is between 0-1. The relative maximum weighting of the index query component is determined by adding the allocated weightings of all the feature query components together. For example, if there are two feature query components worth 0.2 and 0.1 respectively, then the maximum index query weighting will be worth:

Note: the total of all feature query components added together cannot exceed 1.0. If the total equals 1.0 then the index query component is worth zero. If there are no feature query components then the index query is worth the full maximum of 1.0. If there are no index query components, the index query score component is the remainder of 1.0 minus the feature query components.

### Index query

For keyword based searching the index query is an extremely fast method to find documents containing the search terms. By default the index lookup uses OR notation (if a record has any of the search terms, it will be included in the result set). Additionally, sequential query terms also appearing in result records sequentially are worth higher scores. The maximum index score will occur when the input search phrase occurs in the result record (i.e. the phrase sequence is important in the score).

Individual term scores are weighted use an Adaptive Inverse Document Frequency (AIDF) methodology. Ignoring the adaptive component, IDF notation is a logarithmic function making popular terms (large doc_freq) worth less than infrequent terms (low doc_freq) when compared to the document population size (num_docs):

IDF is very common in information retrieval and works reasonably, but it’s not perfect. We augment this with our own methodology that adapts the IDF score based on historical performance. This allows your customer feedback to train the score distribution over time to provide better results.

### Feature query

Feature query components do not interact with term indexes, they instead evaluate field values against the query to create feature components of the score. For each feature, the query component contains an input value to be compared against a field value, this comparison creates the feature value used for scoring.

An example use of the feature query is a credit score algorithm, where the “payment history” is worth 35%, the “credit balance” is worth 30%, etc. In this mode term indexes are not relevant to the score of each component and thus they are calculated independently.

feature query components used without index query components cause a full scan of all records, which can be slower if the number of records is very large. However this can be extremely useful, in the above credit score example, this means records can be sorted and/or filtered by credit score, or aggregations can be used to look at the credit score distribution over all records.

feature query components used in tandem with index query components are both extremely fast and powerful. The index query component is a very fast way to filter out a subset of records on which more complex feature query components can be calculated. An example here is job searching algorithms, where the query text is important, but this can also be layered with features, such as salary, industry and skill predictors for the searchers profile.

### Sorting

By default, results are sorted by their score if no sorts are defined. If sorts are defined, they will be used in order to sort the results using specific field values instead of their score. If multiple sorts are provided they are processed in sequential order with descending priority. The direction of each sort can be either ascending (ASC) or descending (DESC).

Example uses of sorting are time, distance, price, size, name, etc.

### Filtering

A filter is a rule which is checked against the values of a record: a record either satisfies a filter, or it does not. At the top level of a query, filters can be used to remove records from a result set, but they can also be used in FilterFieldBoost to boost records which are matched.
In order to boost or exclude records matching specific conditions, a single filter or boolean combination of multiple filters is used. In the case of excluding records from results, the filter is used as a filter, matching records are immediately dropped from the result set. Whereas in the case of boosting, matching records are either increased or decreased in score.

#### Filter types

Filters can be field based filters or geo based filters (a special case for lat-lng coordinates) which are run against record values, and combinators which are used to combine filters.

##### Geo filter

The geo filter is constructed from a geographical area (a centre point given by a lat-lng coordinates and a radius in km). Records can be matched if they are inside or outside this area. satisfied by a record if filter that creates a radius around a lat-lng geo-point and defines the condition of being either “inside” or “outside” this radius. The radius uses the “haversine” based calculation for distance.

#### Combinators

Are used to group filters to satisfy boolean conditions. Combinators can also include other combinators to achieve a range of conditions.

Combinator Usage
ALL All of the filters must be satisfied.
ANY Any of the filters must be satisfied.
ONE One of the filters (and only one) must be satisfied.
NONE None of the filters must be satisfied.

### Boosting

Under certain conditions, boosts either increase or decrease the score for matching records. A boost is therefore a method for setting both a condition and an associated value if that condition is met.

There are two main types of boosts, which differ due to the underlying data they act on. The first are field boosts, which act on fields. The second are instance boosts, which operate on intersections of terms and documents (i.e. indexes).

#### Boost usage

Instance boosts can only be used on index query components.

Field boosts can be used in both index and feature based query components. A boost component on the index query multiplies the score by the boost value, hence a boost value of zero will drag the index query component score to zero and exclude the record from the result set. If this same boost is used in the feature query component it will only drag that single feature score to zero, so this won’t exclude the record from the result set. Keep this in mind when using field boosts.

#### Field boosts

Each of the various field types can be boosted in a number of different ways.

Field boost type Usage Example
Filter boost when a condition is met if a record is in a particular category, increase it’s value
Interval boost numeric values based on a series of intervals records older than two weeks should be worth zero, then increase in value the closer they are to “now”
Element boost based on the overlap of two lists increase the value of records based on how well their list of “tags” overlap with each other
Text boost based on the overlap of two strings increase the value of profiles where the “summary” text is more similar
##### Filter field boost

Apply a boost to records matching a filter. These are often used to boost categories, e.g. a search made from the “support” section of a website may want to increase the value of “support” related articles. This is a simple example, but can be extended further to combine many complex boolean conditions.

Filter boosts are also used extensively in matching applications, where a series of conditions contribute set percentages of the potential score for each record.

##### Interval field boost

Used on numeric fields, an interval field boost defines a boost based on a sequence of (field-value, boost) pairs.
When a field value falls between two interval points then the associated boost is determined linearly using the boost values from the two points. Field values that fall outside the interval are ignored (so a boost value equivalent to 1).

For example: An interval boost on the field num with points: (value: 0, boost: 0.5), (value: 1, boost: 1.5), would behave as follows:

num Boost
0 0.5
0.5 1.0
1 1.5
2 1.0

Common use cases: time-based weighting (e.g. older than now worth zero, future events with decaying value into the future). This is also used for bucketing values, such as salary bands in human resources, etc.

##### Element field boost

Compares lists of elements using a cosine overlap formula (measures the angle between the two vectors and takes the cosine, which returns a value between 0.0 - 1.0).

Common use cases: social network searches (more friends in common are boosted), dating (comparing lists of interests or hobbies) and human resources (comparing skills), etc.

##### Text field boost

Text field boosting is similar to the element boost, except the inputs are strings. Each string is tokenised into a “bag of words” where each word becomes an element in an array. The arrays of the input and result record are then compared using cosine similarity.

Note: There is no effort to understand word or phrase meaning, this is a very simple comparison. Normally an index query would handle this anyway, but occasionally when comparing things like profiles, it works well to run this on a specific field.

#### Instance boosts

Each intersection is a data point in an index detailing how a term relates to the record it occurred in. This includes information such as which field it occurred in, the position it occurred in the field and a localised instance score that identifies the performance of this intersection in results (for more information on how this is calculated, see the Learning section).

Currently instance boosts are used in several ways:

Instance boost type Usage Example
field weight specific fields more than others favour title intersections for site search
instance score favour instances people prefer in results favour intersections with higher click through rates

### Aggregations

Aggregations are used to perform analytic operations over query result sets. These can be metric operations, such as average, maximum, minimum or count. They can also use custom “buckets”, where each bucket is defined using filters.

#### Numeric aggregations

Numeric aggregations are used to perform analytical queries on a query result set.

#### Bucket aggregations

Bucket aggregations provide a method to group results into custom groupings. Each “bucket” has a “name” used to identify it in the response and a filter to determine which records should be included in the bucket grouping. Given filters can also be grouped, nested and applied on any field, the conditions defining each bucket grouping can be as complex as required.

### Tracking

Feedback is used to incrementally improve results and power analytics. There are two types of feedback learning used with differing impact on search ranking scores.

#### click tracking

Creates a token for each potential search result record, which is passed around with the results and used to create a “click through” URL for each result. By requesting the URL containing the embedded token, the action of a click will be recorded for this particular search result and optionally the user will be redirected on to a new destination.

The token itself contains encoded information about the search generating the results and the result item it references. This means it can always be connected back to the search request that generated it.

The impact of click based tracking is to create an intersection score for every index component involved in a result. For instance if a result underperforms for a given query in a given result position, the instance score for the intersection will reduce. If an instance score boost is used, this will favour results that have performed better at an individual search level for the query. Your customers are actually automatically training your search to become more intelligent.

#### pos-neg tracking

Similar to click tracking, except the result position and expected click through rate (CTR) is not used as an indicator, instead direct feedback is submitted on results. This is more applicable to “like” - “dislike” applications such as dating and human resources applicant processing.

Pos-neg tracking is commonly used for applications where clicks / non-clicks on results don’t correspond to a relevance. Examples include interfaces where results are fed to a user one by one (Tinder style).

### Transforms

Transforms are used to process information on queries and records. They are extremely flexible and can also be chained together and executed sequentially.

#### Query transforms

A query transform takes a query, transforms it and returns a new query. The transform process itself can do anything.

Example uses include autocomplete, spell checking, classification, natural language processing, sentiment analysis, stemming, tokenisation and any other type of feature extraction.

Query transforms can also be applied to search queries either before or after the query is executed.

#### Record transforms

A record transform takes a record, transforms it and returns a new record. Transforms can do anything. The transform process itself can do anything.

Example uses include stemming, tokenisation, classification, clustering, word embedding and any other type of feature extraction.

#### Using transforms for matching

Transforms can be used to extract features from both queries and documents for comparison during the search process.