Skip to end of metadata
Go to start of metadata

In this topic:

Creating an indexing pipeline extension implies writing Python code that uses the document object to manipulate item properties (see Creating an Indexing Pipeline Extension with the API and Coveo Cloud V2 Indexing Pipeline). 

This topic provides reference information describing the object methods and their parameters.

Get URI

This method is used to get the item URI.

Get Metadata

This method is used to get all item metadata.

It returns a list of MetaDataValue objects (see Document Object JSON Schema).

Get Metadata Value

Used to get a metadata for a given origin.

Parameters

ParameterDescription
nameThe name of the metadata
origin

[Optional] The metadata value set by either one of the following components:

NameDescription
crawler

The metadata value set during the Crawling stage

converterThe metadata value set during the Processing stage
mapping

The metadata value set during the Mapping stage

If no value is supplied and the reverse value is True, the most recent origin is considered, i.e. crawler in preconversion and mapping in postconversion.

reverse

[Optional] Boolean used to determine whether to get the metadata origin in reverse order or not. The default value is True, meaning that the value is fetched from the latest indexing pipeline stage with a non-empty value.

It returns a list of strings.

Add Metadata

Used to add an item metadata.

Replace metadata_name and metadata_value with the chosen values.

Used to unset an item metadata.

Log

Use this method in your extension script to tag source items with a relevant indexing message that is sent to the Coveo Cloud V2 source logs. Log messages are useful when you want to edit, debug or troubleshoot an extension scripts. For instance, it is a common practice to use the try/catch or try/except block to log an error as a string in the source logs. It is recommended to use the Log method since outputting text to a field as a form of logging can be a serious index bloat. For instance, using the Get Metadata method to output the metadata content to a field is a bad practice.

Parameters

ParameterTypeDescription
messageRequired: string

The message that you want to log when applying an extension script.

severitystring

Optionaly used to indicate the message severity type.

Default value is Normal.

The allowed case insensitive severity values are:

  • Debug
  • Detail
  • Error
  • Fatal
  • Important
  • Normal
  • Notification
  • Warning

Script Example

This script example uses the Log method twice. First, the try block modifies the metadata before logging a success message. When the try block fails, the except block catches the exception and sends a log containing the error message.

Output

In the preceding extension script, the first occurrence of the Log method is called when the script runs without raising an error. In this particular case, the second argument is missing as the default value Normal is used to define the log message severity. The log message generated by the extension script can be seen in an added subsection named Logs.
In the preceding extension script, the second occurrence of the Log method is called when an exception is raised. This exception is caught and sends a message to document the error in the Logs subsection.

Icon

Applying an extension populates the documentLogEntries.meta.logs field that contains all log messages and severity type strings. This field length is limited to approximately 4K characters, after which the content is truncated. When the added length of multiple log messages gets over the limit, it is still possible to view all the messages that fits within the limit but the log message that sits on the limit is replaced with a truncated... mention as the following messages are ignored. For instance, when a very long string gets over the limit, even if it represents the one and only log that applies to your extension, the whole string is replaced with the truncated... mention.

Get Permissions

Used to get all item permissions.

It returns a list of PermissionLevel objects (see Document Object JSON Schema).

Clear Permissions

Used to clear all item permissions.

Add Allowed Permission

Used to add an allowed security identity.

Parameters

ParameterDescription
identityThe name of the allowed security identity to add
identity_type

The security identity type can be:

  • user
  • group
  • virtualgroup
  • unknown
security_providerThe name of the security identity provider
additional_infoA collection of key value pairs that can be used to uniquely identify the security identity.

Add Denied Permission

Used to add a denied security identity.

Parameters

ParameterDescription
identityThe name of the denied security identity to add
identity_type

The security identity type can be:

  • user
  • group
  • virtualgroup
  • unknown
security_providerThe name of the security identity provider
additional_infoA collection of key value pairs that can be used to uniquely identify the security identity.

Set Permissions

Used to set item permissions.

The permission model complexity can range from allowing full anonymous access to requiring the resolution of permissions for several permission levels, each containing one or more permissions sets.

Example

Get Data Streams

You can optionally read an item data streams in cases where you need to read or modify these streams. Include and process a data stream only when you need it to optimize indexing performances.  

Used to get the item data streams. 

Used to get a data stream for a given origin.

Tip:

 For Web and Sitemap type sources, it is recommended to use the web scrapping feature rather than extensions to do common HTML content processing such as excluding sections and extracting metadata (see Web Scraping Configuration).

Parameters

ParameterDescription
name

The available item data streams are:

  • documentdata
    The complete item binary content extracted by the Crawling stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline).

    Example:

    The documentdata of a PDF file is the actual PDF file.

    The documentdata of a web page is the page HTML markup.

    You may want to retrieve the documentdata of an item in a Preconversion extension in rare cases where you want to modify the original item content.

    Example:
    You indexed scanned items that are saved as image files. You want to index the text content of the images. You use a preconversion extension to read each image documentdata, send it to a third party optical character recognition service (OCR) service, and save the returned text back in the documentdata so that the Processing stage can prepare the text content for the Indexing stage.

    Getting the documentdata can significantly degrade indexing performances because each item binary data has to be fetched, decompressed, and decrypted.
    There is generally no point to get and modify the documentdata in a postconversion extension because the Indexing stage does not process it.

    Icon

    In the Coveo Cloud administration console Add/Edit an Extension panel, the documentdata is referred to as the Original file.

  • body_text
    The complete textual content of an item extracted by the converter in the Processing stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline ).
    You can get the body_text of each item in a postconversion extensions for rare cases where you want to access and possibly modify the item text content.
    There is no point in getting and modifying the body_text in a preconversion extension because the Processing stage would overwrite it.

    Note:

    For index size and performance optimization, the body_text is limited in size to 50 MB. This means that for rare items with larger body_text, the exceeding text will not be indexed, and therefore not searchable.


  • body_html
    The complete HTML representation of an item created by the converter in the Processing stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline ). The body_html appears in the Quick View of a search result item.
    You can get the body_html of each item in a postconversion extensions for cases where you want to access and possibly modify the item text content.

    Example:

    Your source indexes a question and answer website. Each question and each answer is indexed as a separate item even if they can come from the same HTML page. Your indexed items do not have the <head> elements from the original HTML page and therefore are missing resources such as CSS. Consequently, the Quick View for these items does not look good.

    You get the body_html in an extension and inject the appropriate <head> elements.

    There is no point in getting and modifying the body_html in a preconversion extension because the Processing stage would overwrite it.

    Notes:

    When you can define your desired body_html content as a static HTML markup containing metadata placeholders, it is generally simpler to use a mapping on the body field (see Add/Edit Mapping).

    For index size and performance optimization, the body_html is limited in size to 10 MB. This means that the Quick View of items with larger body_html will be truncated.

  • $thumbnail$
    The thumbnail image created by the converter in the Processing stage of the indexing pipeline for specific file types ( Microsoft Word, Excel, PowerPoint, and Visio as well as many image file types such as JPG, BMP, GIF, TIF, PSD, PNG... ).
    You can get the $thumbnail$ in a postconversion extension in the rare cases where you want to modify the thumbnail or extract information from the thumbnail image.
    Your thumbnail image can have any size, resolution or format (as long as a browser can display it), but it is a good practice to stick to a normalize image size and resolution.

    Icon

    If you want to overwrite the thumbnail (or create one) you do not need to get the $thumbnail$.


origin

[Optional] The metadata value set by either one of the following components:

NameDescription
crawlerThe stream value set during the Crawling stage
converterThe stream value set during the Processing stage
mappingThe stream value set during the Mapping stage

If no value is supplied and the reverse value is True, the most recent origin is considered, i.e. crawler in preconversion and mapping in postconversion.

reverse[Optional] Boolean used to determine whether to get the stream origin in reverse order or not. The default value is True, meaning that the stream is fetched from the latest indexing pipeline stage with a non-empty stream.

It returns a stream of in-memory bytes (see Python Buffered Streams).

Add Data Stream

Used to add a data stream.

Example

Reject

Used to set the item as rejected.

Document Object JSON Schema

The Document object can be represented with the following JSON schema.



  • No labels