Typical search indexes are static. Unless you have a lot of coding experience, you're stuck with periodic index updates (e.g. crawl all content weekly and it's frozen until next week). Static indexes are great from an administration perspective, they make indexes read-only, so they're very predictable and easy to scale. But the downsides are many: Your content is out of date immediately, new content and edits are ignored, deleted content remains and the performance of, and user interaction with your content is completely ignored.
The other problem with a static index is that the index cannot reflect how people are interacting with your content. If an article is newer, shouldn't it be higher in search rankings? If it's popular, should that impact results as well? We think so. This approach allows you to not only optimize how you search your content, but also enables you to recommend related, popular or recent content, all from the same index!
Sajari also allows you to customize how your content is indexed, both from a processing perspective as well as enriching your page data with additional fields. Both these approaches are explained briefly below.
Your website might already contain rich information about your products or events, such as dates, locations, categories, product codes, prices, and more. Sajari can automatically capture this when your web pages are indexed, and then this can be used in searching and powering recommendations. Changing a price, location or any other data on your site will then be automatically synchronized to the search system.
Note: when using the API to add objects to a collection, you can specify whatever fields you like. The examples shown below are for customers that want automatic indexing of their website.
Custom fields are defined in HTML by adding
data attributes to elements. To avoid name clashes with other systems, all data attributes used by the crawler have prefix
Note: Any new fields encountered by the crawler are created as
STRING fields by default. If you instead need a different type (
TIME etc) then first create the field using the Schema tab of the Console.
By default the crawler reads
<meta> tags within
<head>, but only keeps standard fields (title, description, keywords, etc). Add a
data-sj-field="fieldname" attribute to override this behaviour and create a custom field from the meta tag's
content attribute. This example shows an otherwise ignored
<meta> tag being converted into a custom field
<meta property="custom meta field" data-sj-field="fieldname" content="fieldvalue"/>
To capture data already rendered within an element, add
data-sj-field="fieldname" to it:
<span data-sj-field="random">This text is the value</span>
This will set custom field
random="This text is the value".
If you don't want the data rendered on the page, then you can also set the field value using the data attribute.
<span data-sj-field="fieldname" data-sj-value="fieldvalue">This text is not used because the data attribute has a value</span>
Fields are useful in many ways. They can be displayed, or used to influence the way queries are sorted or filtered and much much more. Below are some samples of how custom meta can be used with HTML to automatically change the behaviour of search boxes and recommendations widgets.
Problem: I want a search box, but I only want to show results from a particular category.
Solution: On each page with an associated category we add the category to the page as a custom field (note: this does not need to be visible). e.g.
<span data-sj-field="category" data-sj-value="unstructured data"></span>
In the above case, the attribute
data-sj-field="category" indicates this page has an associated "category" that should be added as a field.
Problem: I only want to show results that have products with price greater than $10.00.
Solution: On each page with an associated product we add the product details (including price) to the page as meta data. e.g.
<div data-sj-field="sku" data-sj-value="12345"> <span data-sj-field="product">blue widgets</span> <span data-sj-field="price">20.00</span> </div>
In the above case, the page will have an associated "sku", "product" and "price"
Problem: I have very locally targeted content and wish to recommend local content based on my site visitor location.
Solution: On each "locally" targeted content page, add two pieces of meta information as follows. e.g.
<span data-sj-field="lat" data-sj-value="-33.867487"></span><span data-sj-field="lng" data-sj-value="181.3615434"></span>
In the above case, the prefix
data-sj-field indicates this is information specific to the page. So
data-sj-field="lat" indicates this page has a property called "lat" with corresponding value -33.867487.
If you want to use Sajari functionality on a page, but you don't want it indexed, there are several ways to deal with this. By adding
data-sj-noindex to any element on a page, Sajari will not index that particular page. Because this is typically analogous to preventing other indexing, like web search engines, we frequently see this combined with the noindex meta tag, e.g.
<meta name="robots" content="noindex" data-sj-noindex />
The second option is to create a ranking rule to exclude specific pages. These rules can help exclude entire directories, specific pages, etc.
It's typical for some content to appear on every page in a website: menus, navigation, footer etc. To avoid this being added into the search index as part of the body of the page, our crawler builds the page
_body via a summarisation algorithm which works on paragraphs of text extracted from the page.
When considering paragraphs to include in the summary, the crawler ignores anything inside
<footer> tags. The crawler also ignores any text within HTML elements with the
<p data-sj-ignore>This paragraph of text will not be passed to the summarisation system.</p>