To begin, we need to explain that we refer to an particular search index as a
collection (i.e. a collection of
records that make up the index). Accounts in the system are called
- collection a
schemaand collection of
recordswhich make up a search index.
- schema definition of fields which can be defined on
records, including properties and constraints that can be applied fields.
- record field-value pairs (i.e. structured key-value pairs) representing an object stored in the search index, analogous to a row in a database.
- field is a sub-component of a record, analogous to a column in a database.
A schema is a list of fields which are valid record fields for a collection.
- The name of the field. This uniquely defines the field.
- Type of data stored in the field (
- The field contains a single value or repeated values of the same type.
- Whether the field should be query-able in index-based queries.
- Whether this field should be required (i.e. set) for all records.
- Values in this field must be unique (i.e. they uniquely define a record within a collection).
||String of text||
A record is a set of field-value pairs of data that can be queried in a collection. The database equivalent is a “row”, in information systems it could be referred to as a “document” or “object”.
New records can be added to a
collection. Adding a record stores the field information in two ways:
transformsassociated with the add request are applied to the fields to produce a set of
terms, for any fields marked as
schema, these terms are added to reverse indexes.
Allows stored records to be modified.
Allows records to be deleted.
- body is a weighted free-text component of a query.
- term is a tokenised piece of text, e.g. a word or number.
- intersection describes where terms and records intersect, e.g. the term exists in the record.
- index is a list of intersections. A forward index describes the terms in a particular record. A reverse index describes all the records which contain a particular term.
- index query is a query component that evaluates the indexed text from each result
record. This deals with
- feature query is a query component that evaluates features extracted from each result
- instance boost changes the importance of a
recordintersection instance based on specified criteria.
- feature boost changes the importance of an extracted feature based on specified criteria.
- filter a rule which is evaluated against the
- sort allows the results to be sequentially sorted instead of default score based sorting.
- offset return results beginning with the one at
- limit is the number of records to return in a response.
- fields which fields are returned with each
- aggregations are used to build aggregated analytical information from a set of search query results.
- transforms modify queries in various ways. They can create new information or augment existing query information.
Searching follows a basic flow:
The first step, identifying
records to include in the result set, can be a combination of:
A search is made up of two main components, the
index query and
feature query. These two can be used in isolation, or together. Both can be boosted based on the
Keyword based searches start with the
index query, which deals with traditional inverse index lookups (i.e. query matches in documents). The initial
index query score is generated based on the overlap of the input query
terms and the
fields. Where as the
feature query deals with creating isolated
features from the
record that each make up set percentages of the result score.
Matching queries mainly utilise the
feature query component, as this allows many components to be created with set weightings to create a cumulative match score.
Queries including an
index query component are very fast and efficient as they immediately filter the potential result set to those with matching
Put simply, these two components make the scoring system very, very flexible. Some scoring components are multiplicative with respect to index
intersections, others are additive (focused on field information) and ignore the index calculation altogether. Simplistically this can be thought of as:
Search applications tend to focus on the
index query, where matching text is critical to scoring results. Any boosting on the
index query is multiplicative. An example of this is weighting fields differently, for web content it’s common that a “title” match is more important than other fields, a “product name” is more important for ecommerce, etc.
Matching applications tend to use many
feature query components to create their own match score. Machine learning can help determine what each of these components should be worth. By allowing these to be weighted in the query, this also allows applications to experiment to optimise weightings. Whereas pure search applications may not use any
feature query components at all, but it may be more important for keyword matches to occur in certain fields (e.g. “title”).
The score for each result document is between 0-1. The relative maximum weighting of the
index query component is determined by adding the allocated weightings of all the
feature query components together. For example, if there are two
feature query components worth 0.2 and 0.1 respectively, then the maximum
index query weighting will be worth:
Note: the total of all
feature query components added together cannot exceed 1.0. If the total equals 1.0 then the
index query component is worth zero. If there are no
feature query components then the
index query is worth the full maximum of 1.0. If there are no
index query components, the
index query score component is the remainder of 1.0 minus the
feature query components.
For keyword based searching the
index query is an extremely fast method to find documents containing the search terms. By default the index lookup uses
OR notation (if a
record has any of the search terms, it will be included in the result set). Additionally, sequential query terms also appearing in result
records sequentially are worth higher scores. The maximum index score will occur when the input search phrase occurs in the result
record (i.e. the phrase sequence is important in the score).
Individual term scores are weighted use an Adaptive Inverse Document Frequency (AIDF) methodology. Ignoring the adaptive component, IDF notation is a logarithmic function making popular terms (large
doc_freq) worth less than infrequent terms (low
doc_freq) when compared to the document population size (
IDF is very common in information retrieval and works reasonably, but it’s not perfect. We augment this with our own methodology that adapts the IDF score based on historical performance. This allows your customer feedback to train the score distribution over time to provide better results.
Feature query components do not interact with term
indexes, they instead evaluate
field values against the query to create feature components of the score. For each feature, the query component contains an input value to be compared against a
field value, this comparison creates the feature value used for scoring.
An example use of the
feature query is a credit score algorithm, where the “payment history” is worth 35%, the “credit balance” is worth 30%, etc. In this mode term
indexes are not relevant to the score of each component and thus they are calculated independently.
feature query components used without
index query components cause a full scan of all records, which can be slower if the number of
records is very large. However this can be extremely useful, in the above credit score example, this means
records can be sorted and/or filtered by credit score, or
aggregations can be used to look at the credit score distribution over all
feature query components used in tandem with
index query components are both extremely fast and powerful. The
index query component is a very fast way to filter out a subset of
records on which more complex
feature query components can be calculated. An example here is job searching algorithms, where the query text is important, but this can also be layered with
features, such as salary, industry and skill predictors for the searchers profile.
By default, results are sorted by their
score if no
sorts are defined. If
sorts are defined, they will be used in order to sort the results using specific field values instead of their score. If multiple sorts are provided they are processed in sequential order with descending priority. The direction of each sort can be either ascending (ASC) or descending (DESC).
Example uses of sorting are time, distance, price, size, name, etc.
A filter is a rule which is checked against the values of a
record: a record either satisfies a filter, or it does not. At the top level of a query, filters can be used to remove
records from a result set, but they can also be used in
FilterFieldBoost to boost records which are matched.
In order to boost or exclude
records matching specific conditions, a single
filter or boolean combination of multiple
filters is used. In the case of excluding
records from results, the filter is used as a filter, matching
records are immediately dropped from the result set. Whereas in the case of
records are either increased or decreased in score.
Filters can be
field based filters or
geo based filters (a special case for lat-lng coordinates) which are run against record values, and
combinators which are used to combine filters.
geo filter is constructed from a geographical area (a centre point given by a lat-lng coordinates and a radius in km). Records can be matched if they are inside or outside this area. satisfied by a record if filter that creates a radius around a lat-lng geo-point and defines the condition of being either “inside” or “outside” this radius. The radius uses the “haversine” based calculation for distance.
Are used to group filters to satisfy boolean conditions. Combinators can also include other combinators to achieve a range of conditions.
||All of the filters must be satisfied.|
||Any of the filters must be satisfied.|
||One of the filters (and only one) must be satisfied.|
||None of the filters must be satisfied.|
Under certain conditions, boosts either increase or decrease the score for matching
records. A boost is therefore a method for setting both a condition and an associated value if that condition is met.
There are two main types of boosts, which differ due to the underlying data they act on. The first are
field boosts, which act on fields. The second are
instance boosts, which operate on
intersections of terms and documents (i.e. indexes).
Instance boosts can only be used on
index query components.
Field boosts can be used in both
feature based query components. A boost component on the
index query multiplies the score by the boost value, hence a boost value of zero will drag the
index query component score to zero and exclude the
record from the result set. If this same boost is used in the
feature query component it will only drag that single feature score to zero, so this won’t exclude the
record from the result set. Keep this in mind when using field boosts.
Each of the various field types can be boosted in a number of different ways.
|Field boost type||Usage||Example|
||boost when a condition is met||if a
||boost numeric values based on a series of intervals||
||boost based on the overlap of two lists||increase the value of
||boost based on the overlap of two strings||increase the value of profiles where the “summary” text is more similar|
Apply a boost to
records matching a filter. These are often used to boost categories, e.g. a search made from the “support” section of a website may want to increase the value of “support” related articles. This is a simple example, but can be extended further to combine many complex boolean conditions.
Filter boosts are also used extensively in
matching applications, where a series of conditions contribute set percentages of the potential score for each
Used on numeric fields, an interval field boost defines a boost based on a sequence of (field-value, boost) pairs.
When a field value falls between two interval points then the associated boost is determined linearly using the boost values from the two points. Field values that fall outside the interval are ignored (so a boost value equivalent to 1).
For example: An interval boost on the field
num with points: (value: 0, boost: 0.5), (value: 1, boost: 1.5), would behave as follows:
Common use cases: time-based weighting (e.g. older than now worth zero, future events with decaying value into the future). This is also used for bucketing values, such as salary bands in human resources, etc.
Compares lists of elements using a cosine overlap formula (measures the angle between the two vectors and takes the cosine, which returns a value between 0.0 - 1.0).
Common use cases: social network searches (more friends in common are boosted), dating (comparing lists of interests or hobbies) and human resources (comparing skills), etc.
Text field boosting is similar to the element boost, except the inputs are strings. Each string is tokenised into a “bag of words” where each word becomes an element in an array. The arrays of the input and result
record are then compared using cosine similarity.
Note: There is no effort to understand word or phrase meaning, this is a very simple comparison. Normally an
index query would handle this anyway, but occasionally when comparing things like profiles, it works well to run this on a specific field.
intersection is a data point in an index detailing how a
term relates to the
record it occurred in. This includes information such as which field it occurred in, the position it occurred in the
field and a localised
instance score that identifies the performance of this intersection in results (for more information on how this is calculated, see the
Currently instance boosts are used in several ways:
|Instance boost type||Usage||Example|
||weight specific fields more than others||favour
||favour instances people prefer in results||favour
Aggregations are used to perform analytic operations over query result sets. These can be metric operations, such as average, maximum, minimum or count. They can also use custom “buckets”, where each bucket is defined using
Numeric aggregations are used to perform analytical queries on a query result set.
Bucket aggregations provide a method to group results into custom groupings. Each “bucket” has a “name” used to identify it in the response and a
filter to determine which records should be included in the bucket grouping. Given
filters can also be grouped, nested and applied on any field, the conditions defining each bucket grouping can be as complex as required.
Feedback is used to incrementally improve results and power analytics. There are two types of feedback learning used with differing impact on search ranking scores.
Creates a token for each potential search result
record, which is passed around with the results and used to create a “click through” URL for each result. By requesting the URL containing the embedded token, the action of a
click will be recorded for this particular
search result and optionally the user will be redirected on to a new destination.
The token itself contains encoded information about the search generating the results and the result item it references. This means it can always be connected back to the search request that generated it.
The impact of click based tracking is to create an
intersection score for every index component involved in a result. For instance if a result underperforms for a given query in a given result position, the
instance score for the
intersection will reduce. If an
instance score boost is used, this will favour results that have performed better at an individual search level for the query. Your customers are actually automatically training your search to become more intelligent.
Similar to click tracking, except the result position and expected click through rate (CTR) is not used as an indicator, instead direct feedback is submitted on results. This is more applicable to “like” - “dislike” applications such as dating and human resources applicant processing.
Pos-neg tracking is commonly used for applications where clicks / non-clicks on results don’t correspond to a relevance. Examples include interfaces where results are fed to a user one by one (Tinder style).
Transforms are used to process information on queries and records. They are extremely flexible and can also be chained together and executed sequentially.
A query transform takes a query, transforms it and returns a new query. The transform process itself can do anything.
Example uses include autocomplete, spell checking, classification, natural language processing, sentiment analysis, stemming, tokenisation and any other type of feature extraction.
Query transforms can also be applied to search queries either before or after the query is executed.
A record transform takes a record, transforms it and returns a new record. Transforms can do anything. The transform process itself can do anything.
Example uses include stemming, tokenisation, classification, clustering, word embedding and any other type of feature extraction.
Transforms can be used to extract features from both queries and documents for comparison during the search process.