Indexing Data

Crawling a website

First, create a Website Search Collection for your main domain. The crawler will start indexing the pages under that domain. You can preview search results for your Collection to check whether your pages have been correctly indexed.

The speed of the initial crawl depends largely on the size and speed of your site. It can take anywhere from a few seconds to several hours (in case you have hundreds of thousands of pages) to complete.

Managing domains

Often you will find that the content on your website is spread across multiple domains, whether it's a blog subdomain like blog.example.com or a completely different domain like another-example.com. In that case, you can add multiple domains to your Collection.

Add a domain to a Collection

  1. Navigate to the Domains section.
  2. Click "Add domain" from the top right of the page.
  3. Enter the url of the domain you want to add.
  4. Ensure the "Crawling" checkbox is checked if you want to crawl and index the content of the domain.
  5. Click "Add".

Add a website

The Crawler will immediately begin indexing pages from the new domain.

Domain configuration

You can congfigure settings for each domain in the following ways:

  • Crawling: If enabled, the crawler will periodically visit pages in this domain and update your Collection with any changes. If turned off, the pages on this domain will not be updated in your Collection.

  • Search from domain: If enabled, search requests coming from this domain are authorized. Any search interface embedded in this domain will be authorized to make search requests to this collection. A common use case when you will only have this turned on is when you want to allow staging websites or testing environments (e.g. Netlify or Codesandbox). In this case, you will add the domain of the testing environment (e.g. testsite.netlify.com) as an additional domain and turn off crawling for that domain.

Remove a domain from a Collection

  1. Log in to the Console and select the relevant Collection
  2. Navigate to the Domains section.
  3. For the domain you want to remove, click on the three dots (⋯) on the right side and click on "Delete"

How the crawler works

The crawler visits the domains you added and the ones you allowed to be crawled. First, it will check for the existence of a sitemap (see below for more info). The crawler will then index the pages listed in the sitemap as per their priority. As a next step, the crawler will crawl the homepage of the domain (eg. www.example.com). It will then crawl the pages that are linked on homepage, and then crawl the pages linked on those pages, and so on. Note that only those linked pages will be crawled which are hosted on the domains you have added.

You can help the crawler in a number of ways to discover content on your website by:

  • adding a sitemap to your website
  • setting up instant indexing
  • manually point at specific URLs

Note: The crawler will only visit pages from domains that have Crawling enabled.

Using Sitemaps

A sitemap is a web standard that provides a list of URLs available for crawling. It must be present on the root of the domain with the name "sitemap.xml" (e.g. www.example.com/sitemap.xml). The crawler looks for sitemaps on domains that are being indexed and will visit the URLs in any sitemap it finds. If the Crawler does not find your sitemap for some reason, you can point it manually to the sitemap file.

  1. Navigate to Domains > Diagnose
  2. Enter the URL of the sitemap (i.e. www.example.com/sitemap.xml), and press "Diagnose"
  3. Press "Add to Index"

Instant Indexing

The best way to manage crawling on your site is to setup Instant Indexing. Instant Indexing ensures that new and updated pages are immediately available once visited, without having to wait for a full crawl cycle to complete.

It is enabled by adding a small snippet of JavaScript, also known as ping-back code, to pages on your site. When the page is visited by an end-user it will trigger a light-weight background request to the crawler, which will check if the page is new or updated and needs to be reindexed.

You can find the snippet tailored to your Collection in the Instant Indexing section in the Console.

Pingback Install

Popularity

Using the ping-back code also records popularity metrics for each page, that can then be used in the search algorithm to prioritize popular content.

Check or add URLs manually

The diagnose feature in the Domains section provides information on the status of URLs in your domains, including:

  • if the URL has been crawled already
  • redirecting to another URL
  • when the URL was last visited by the crawler
  • crawling errors (if any)

URLs that are not in your collection can also be added using the diagnose tool, and existing URLs can be manually reindexed.

  1. Navigate to Domains section
  2. Click on "Diagnose" button
  3. Enter the URL you want to diagnose.
  4. Press "Add to Index" to crawl the URL.
  5. Check the status of the page by re-diagnosing the URL.

Note: The status might be "Pending" if there are a high number of indexing operations being run. It is usually indexed instantly, but in some cases, it might take a few minutes.

Example - Indexed Page