Website crawling

Sajari adds a small piece of JavaScript to your website pages which allows it to automatically crawl and index your content to make it instantly searchable. Using our JavaScript plugin your index is realtime, Sajari will discover and index content as soon as a new page is viewed or an old one is removed. No more periodic crawls! But that's not all!!! It also enables the popularity of content to be tracked so it can influence your search and recommendations and much more...

Instant indexing of your site

You add a small piece of JavaScript and your site is instantly searchable. That's it. Easy.

What's the big deal?

Typical search indexes are static. Unless you have a lot of coding experience, you're stuck with periodic index updates (e.g. crawl all content weekly and it's frozen until next week). Static indexes are great from an administration perspective, they make indexes read-only, so they're very predictable and easy to scale. But the downsides are many: Your content is out of date immediately, new content and edits are ignored, deleted content remains and the performance of, and user interaction with your content is completely ignored.

Instant indexing

By using a JavaScript ping back to Sajari servers as your content is viewed, we automatically manage your index synchronization for you. If the ping back sends a page that is already indexed, it's "view" counter is incremented and "lastseen" timestamp is updated. If it changes status, e.g. is moved, removed or a noindex tag is added, that will be updated in seconds. If the page is not yet indexed it will be crawled and indexed automatically. If the page links to other pages that are new, they will also be crawled and indexed.

Popularity and recency

The other problem with a static index is that the index cannot reflect how people are interacting with your content. If an article is newer, shouldn't it be higher in search rankings? If it's popular, should that impact results as well? We think so. This approach allows you to not only optimize how you search your content, but also enables you to recommend related, popular or recent content. All from the same automatic index!


Sajari also allows you to customize how your content is indexed, both from a processing perspective as well as enriching your page data with additional fields. Both these approaches are explained briefly below.

Adding custom fields

Your website might already contain rich information about your products or events, such as dates, locations, categories, product codes, prices, etc... Sajari can automatically capture this when your web pages are indexed, and then this can be used in searching and powering recommendations. Changing a price, location or any other data on your site will then be automatically synchronized to the search system.

Note: when using the API to add objects to a collection, you can specify whatever fields you like. The examples shown below are for customers that want automatic indexing of their website.

Adding fields via HTML

Custom fields are defined in HTML by adding data attributes to elements. To avoid name clashes with other systems, all data attributes used by the crawler have prefix data-sj-.

Note: Any new fields encountered by the crawler are created as STRING fields by default. If you instead need a different type (INTEGER, FLOAT, TIME etc) then first create the field using the Schema tab of the Console.

Defining custom fields in <head> elements

By default the crawler reads <meta> tags within <head>, but only keeps standard fields (title, description, keywords, etc). Add a data-sj-field="fieldname" attribute to override this behaviour and create a custom field from the meta tag's content attribute. This example shows an otherwise ignored <meta> tag being converted into a custom field fieldname="fieldvalue":

<meta property="custom meta field" data-sj-field="fieldname" content="fieldvalue"/>

Defining custom fields in <body> elements

To capture data already rendered within an element, add data-sj-field="fieldname" to it:

<span data-sj-field="random">This text is the value</span>

This will set custom field random="This text is the value".

If you don't want the data rendered on the page, then you can also set the field value using the data attribute.

<span data-sj-field="fieldname" data-sj-value="fieldvalue">This text is not used because the data attribute has a value</span>

Examples using custom fields data

Fields are useful in many ways. They can be displayed, or used to influence the way queries are sorted or filtered and much much more. Below are some samples of how custom meta can be used with HTML to automatically change the behaviour of search boxes and recommendations widgets.


Problem: I want a search box, but I only want to show results from a particular category.

Solution: On each page with an associated category we add the category to the page as a custom field (note: this does not need to be visible). e.g.

<span data-sj-field="category" data-sj-value="unstructured data"></span>

In the above case, the attribute data-sj-field="category" indicates this page has an associated "category" that should be added as a field.


Problem: I only want to show results that have products with price greater than $10.00.

Solution: On each page with an associated product we add the product details (including price) to the page as meta data. e.g.

<div data-sj-field="sku" data-sj-value="12345">
    <span data-sj-field="product">blue widgets</span>
    <span data-sj-field="price">20.00</span>

In the above case, the page will have an associated "sku", "product" and "price"


Problem: I have very locally targeted content and wish to recommend local content based on my site visitor location.

Solution: On each "locally" targeted content page, add two pieces of meta information as follows. e.g.

<span data-sj-field="lat" data-sj-value="-33.867487"></span><span data-sj-field="lng" data-sj-value="181.3615434"></span>

In the above case, the prefix data-sj-field indicates this is information specific to the page. So data-sj-field="lat" indicates this page has a property called "lat" with corresponding value -33.867487.

Prevent pages being indexed

If you want to use Sajari functionality on a page, but you don't want it indexed, there are several ways to deal with this. By adding data-sj-noindex to any element on a page, Sajari will not index that particular page. Because this is typically analogous to preventing other indexing, like web search engines, we frequently see this combined with the noindex meta tag, e.g.

<meta name="robots" content="noindex" data-sj-noindex />

The second option is to create a ranking rule to exclude specific pages. These rules can help exclude entire directories, specific pages, etc. Crawling rules is a menu option in the "Setup" menu once logged in.

It's typical for some content to appear on every page in a website: menus, navigation, footer etc. To avoid this being added into the search index as part of the body of the page, our crawler builds the page _body via a summarisation algorithm which works on paragraphs of text extracted from the page.

When considering paragraphs to include in the summary, the crawler ignores anything inside <head>, <script>, <header> and <footer> tags. The crawler also ignores any text within HTML elements with the data-sj-ignore attribute:

<p data-sj-ignore>This paragraph of text will not be passed to the summarisation system.</p>

Get started now

Start your 14-day free trial!

Start 14-Day Free Trial →

No credit card required

Some other happy companies using Sajari Website Search
lockheed martin customer logo foxtel customer logo canva customer logo unity customer logo australian institute of family studies customer logo