Indexing pipeline customization tools overview

This article provides an overview of the Coveo tools and features that you can use to customize how each candidate item is processed through the Coveo indexing pipeline. You can sometimes choose between a few tools to achieve the same indexing customization goal, but some of them may impact performance, or may only be available for certain connectors.

Example

As a developer, your first choice might be to use indexing pipeline extensions (IPEs) because they’re scripts and offer great flexibility. However, IPEs can also decrease indexing performance, and there’s often another tool or feature that can achieve the same goal with less overhead.

The article lists customization tools starting with the ones that are the most appropriate to use either because of their effectiveness, ease of use, or performance optimization.

URL filters

Indexing goal

Control the indexing scope, choosing which repository pages or sections to include in your source.

How to use

Advantages

  • Efficiently processed by the Crawling stage.

  • Wildcard or regex flexibility.

  • For Web sources, easy configuration from the Administration Console (see Add or edit a Web source).

Disadvantages

  • Only applicable to URL based source types (for example, Web and Sitemap).

  • For Sitemap sources, configuration from the source JSON can be challenging.

Mappings

Indexing goal

  • Extract original document metadata values to populate specific Coveo index fields.

  • Customize or create an index item body for object-based source types, such as Salesforce or a database.

How to use

Advantages

  • Efficiently processed by Mapping stage.

  • Conditional mappings based on item type.

  • Can concatenate one or more metadata and include personalized text with the Literal option.

  • Can edit body content (see Add or edit a body mapping).

  • Can get metadata values from a specific stage with the origin suffix (see Mapping rules syntax).

Disadvantages

  • Can’t programmatically process metadata values.

Web scraping configuration

Indexing goal

  • Exclude specific web page sections.

  • Extract specific content to create metadata.

  • Create sub-items.

How to use

Advantages

  • Efficiently processed by the Crawling stage.

  • Coveo Labs Chrome extension is available to easily create web scraping configurations (see web-scraper-helper).

  • Exclusion of repeating web page parts from index (for example, header, sidebar, footer) (see Elements to exclude).

  • Extraction of content from HTML elements with XPATH and CSS locators to enrich metadata (see Metadata to extract).

  • Splitting web page parts into multiple index items (see SubItems).

Disadvantages

  • Only available for Web and Sitemap sources.

  • Requires developers skills to create the JSON web scraping configuration and take full advantage of XPATH and CSS expressions.

Indexing pipeline extension

Indexing goal

  • When it’s not possible with the above tools to:

    • Add/modify metadata.

    • Add/modify data streams.

    • Reject items (in pre-conversion scripts).

    • Exclude specific web page sections.

  • Use external resources and services (for example, use an image recognition API to inject metadata).

  • Add/modify item permissions (for example, for a Push source for which the crawler doesn’t associate permissions).

How to use

Advantages

  • Accessibility to third party services and databases.

  • Flexibility of Python language and available libraries (see Python modules available to IPEs).

  • Extension code reuse with conditional execution and extension parameters.

  • Index item processing to:

Disadvantages

  • Requires developer skills to create the Python scripts.

  • Each extension execution affects indexing performance.

  • Limit of 10 indexing pipeline extensions per organization.

  • Extension script execution limited to 5 seconds.