Search and indexing mechanics¶
Afi has custom parsing, indexing, and search logic for different workload kinds, tailored to a specific workload type and its searchable properties/fields. The service allows to match query terms across all searchable fields at once (default mode) as well as search for matches in the specified fields (for example, search for emails with a certain term in an email subject). When several search terms are specified, a search query will match the items that contain all the terms in the query. Also, you can apply additional filters to limit the search scope based on item date and its visibility in the selected backup snapshot.
For text-based content the service offers full-text search, supporting both exact term matching as well as language detection and wordform-based matching for English, French, German, Spanish, Portugues, Italian, as well as a number of other European languagues. For CJK (Chinese, Japanese, Korean) language group Afi uses bigram-based matching.
During data indexing, Afi performs preprocessing over item fields and properties based on their type, including cleaning up the content and removing stopwords (short common words like the
, of
, an
), detecting its language where applicable, as well as performing tokenization and stemming. Same preprocessing is applied to the search queries provided by a user. Tokenization rules for different kinds of fields as well as search capabilities enabled by these rules are explained below.
Field type | Matching rules |
---|---|
Text-based fields |
Text-based fields include email, calendar event, or chat message subject and body, as well as similar fields for other workloads that contain arbitrary text-based content. During indexing, text-based fields are cleaned up and tokenized based on the detected content language to enable word-form matching. |
Email addresses |
Email address indexing rules apply to Gmail, Exchange, Google Chat, and Team Chat message recipients and senders, email fields in other object types (for example, email fields for Google and Exchange contact objects), as well as to email addresses present in arbitrary text-based content. During indexing, email fields are parsed and tokenized to enable matching both by name and domain components. For example, an email address John Doe <john.doe@test.example.com> is searchable by the following keywords: john , doe , john.doe , test , example , com , example.com , and test.example.com .
|
Phone numbers |
Phone number indexing rules apply to the corresponding fields in Google and Exchange contacts as well as to similar fields for other workloads, for example, phone number fields in Entra ID (Azure Active Directory) user items. During indexing, phone number fields are parsed to detect extension and main part and tokenized to enable matching by a substring of digits from the original number. For example, a phone number +1-305-200-22-33, ext. 100 is searchable by the following keywords: 305 , 200-22-33 , 2233 , 100 , as well as other number parts.
|
File names |
File name fields include file and folder names in Google Drive, OneDrive, and SharePoint, as well as item attachment names for all workloads where attachments are supported (emails, chat messages, tasks, etc.). During indexing, filename fields are split into tokens by special symbols (spaces, dots, commas, etc.), excluding short tokens (like to or a ), and support matching by any resulting token (including file extension) or by the exact field value.For example, a file name 2024_Sales reports-2024.06.01.docx is searchable by the following keywords: 2024 , sales , report , reports , 2024.06.01 , docx , as well as 2024_sales reports-2024.06.01.docx .To ensure exact file extension matching, please use search terms like .<extension> , for example, .docx .
|
Custom (keyword) fields |
Custom fields include item names (except for file/folder names), identifiers, categories, etc. During indexing, custom fields are split by spaces and support exact matching by any space-separated part or by the exact field value. For example, a contact name John Doe is searchable by the john and doe keywords, as well as by john doe .
|