Skip to end of metadata
Go to start of metadata

In this topic:

Creating an indexing pipeline extension implies writing Python code that uses the document object to manipulate item properties (see Creating an Indexing Pipeline Extension with the API and Coveo Cloud V2 Indexing Pipeline). 

This topic provides reference information describing the object methods and their parameters.

Note:

Other resources:


Get URI

This method is used to get the item URI.

Get Metadata

This method is used to get all item metadata.

It returns a list of MetaDataValue objects (see Document Object JSON Schema).

Get Metadata Value

Used to get a metadata for a given origin.

Parameters

ParameterDescription
nameThe name of the metadata
origin

[Optional] The metadata value set by either one of the following components:

NameDescription
crawler

The metadata value set during the Crawling stage

converterThe metadata value set during the Processing stage
mapping

The metadata value set during the Mapping stage

If no value is supplied and the reverse value is True, the most recent origin is considered, i.e. crawler in preconversion and mapping in postconversion.

reverse

[Optional] Boolean used to determine whether to get the metadata origin in reverse order or not. The default value is True, meaning that the value is fetched from the latest indexing pipeline stage with a non-empty value.

It returns a list of strings.

Add Metadata

Used to add an item metadata.

Replace metadata_name and metadata_value with the chosen values.

Used to unset an item metadata.

Log

Use this method in your extension script to send a log entry to the Coveo Cloud source logs.

The good practice is to log messages for debugging or troubleshooting purposes, such as when an error condition occurs. Avoid systematically sending log messages for all processed items to prevent floading the logs with useless information.

You can view logged messages from the Coveo Cloud administration console Log Browser page (see Log Browser).

Parameters

ParameterTypeDescription
messagestring

The logged message text.

The message length is limited to approximately 4K characters. Longer messages are truncated.

severitystring

Indication of the message severity or type.

The allowed severity values are:

  • fatal
  • error
  • important
  • normal
  • debug
  • notification
  • warning
  • detail


Get Permissions

Used to get all item permissions.

It returns a list of PermissionLevel objects (see Document Object JSON Schema).

Clear Permissions

Used to clear all item permissions.

Add Allowed Permission

Used to add an allowed security identity.

Parameters

ParameterDescription
identityThe name of the allowed security identity to add
identity_type

The security identity type can be:

  • user
  • group
  • virtualgroup
  • unknown
security_providerThe name of the security identity provider
additional_infoA collection of key value pairs that can be used to uniquely identify the security identity.

Add Denied Permission

Used to add a denied security identity.

Parameters

ParameterDescription
identityThe name of the denied security identity to add
identity_type

The security identity type can be:

  • user
  • group
  • virtualgroup
  • unknown
security_providerThe name of the security identity provider
additional_infoA collection of key value pairs that can be used to uniquely identify the security identity.

Set Permissions

Used to set item permissions.

The permission model complexity can range from allowing full anonymous access to requiring the resolution of permissions for several permission levels, each containing one or more permissions sets.

Example

Get Data Streams

You can optionally read an item data streams in cases where you need to read or modify these streams. Include and process a data stream only when you need it to optimize indexing performances.  

Used to get the item data streams. 

Used to get a data stream for a given origin.

Tip:

 For Web and Sitemap type sources, it is recommended to use the web scrapping feature rather than extensions to do common HTML content processing such as excluding sections and extracting metadata (see Web Scraping Configuration).

Parameters

ParameterDescription
name

The available item data streams are:

  • documentdata
    The complete item binary content extracted by the Crawling stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline).

    Example:

    The documentdata of a PDF file is the actual PDF file.

    The documentdata of a web page is the page HTML markup.

    You may want to retrieve the documentdata of an item in a Preconversion extension in rare cases where you want to modify the original item content.

    Example:
    You indexed scanned items that are saved as image files. You want to index the text content of the images. You use a preconversion extension to read each image documentdata, send it to a third party optical character recognition service (OCR) service, and save the returned text back in the documentdata so that the Processing stage can prepare the text content for the Indexing stage.

    Getting the documentdata can significantly degrade indexing performances because each item binary data has to be fetched, decompressed, and decrypted.
    There is generally no point to get and modify the documentdata in a postconversion extension because the Indexing stage does not process it.

    Icon

    In the Coveo Cloud administration console Add/Edit an Extension panel, the documentdata is referred to as the Original file.

  • body_text
    The complete textual content of an item extracted by the converter in the Processing stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline ).
    You can get the body_text of each item in a postconversion extensions for rare cases where you want to access and possibly modify the item text content.
    There is no point in getting and modifying the body_text in a preconversion extension because the Processing stage would overwrite it.

    Note:

    For index size and performance optimization, the body_text is limited in size to 50 MB. This means that for rare items with larger body_text, the exceeding text will not be indexed, and therefore not searchable.


  • body_html
    The complete HTML representation of an item created by the converter in the Processing stage of the indexing pipeline (see Coveo Cloud V2 Indexing Pipeline ). The body_html appears in the Quick View of a search result item.
    You can get the body_html of each item in a postconversion extensions for cases where you want to access and possibly modify the item text content.

    Example:

    Your source indexes a question and answer website. Each question and each answer is indexed as a separate item even if they can come from the same HTML page. Your indexed items do not have the <head> elements from the original HTML page and therefore are missing resources such as CSS. Consequently, the Quick View for these items does not look good.

    You get the body_html in an extension and inject the appropriate <head> elements.

    There is no point in getting and modifying the body_html in a preconversion extension because the Processing stage would overwrite it.

    Notes:

    When you can define your desired body_html content as a static HTML markup containing metadata placeholders, it is generally simpler to use a mapping on the body field (see Add/Edit Mapping).

    For index size and performance optimization, the body_html is limited in size to 10 MB. This means that the Quick View of items with larger body_html will be truncated.

  • $thumbnail$
    The thumbnail image created by the converter in the Processing stage of the indexing pipeline for specific file types ( Microsoft Word, Excel, PowerPoint, and Visio as well as many image file types such as JPG, BMP, GIF, TIF, PSD, PNG... ).
    You can get the $thumbnail$ in a postconversion extension in the rare cases where you want to modify the thumbnail or extract information from the thumbnail image.
    Your thumbnail image can have any size, resolution or format (as long as a browser can display it), but it is a good practice to stick to a normalize image size and resolution.

    Icon

    If you want to overwrite the thumbnail (or create one) you do not need to get the $thumbnail$.


origin

[Optional] The metadata value set by either one of the following components:

NameDescription
crawlerThe stream value set during the Crawling stage
converterThe stream value set during the Processing stage
mappingThe stream value set during the Mapping stage

If no value is supplied and the reverse value is True, the most recent origin is considered, i.e. crawler in preconversion and mapping in postconversion.

reverse[Optional] Boolean used to determine whether to get the stream origin in reverse order or not. The default value is True, meaning that the stream is fetched from the latest indexing pipeline stage with a non-empty stream.

It returns a stream of in-memory bytes (see Python Buffered Streams).

Add Data Stream

Used to add a data stream.

Example

Reject

Used to set the item as rejected.

Document Object JSON Schema

The Document object can be represented with the following JSON schema.



  • No labels