De-duplication in eDiscovery search results

This article describes how de-duplication of eDiscovery search results works and explains the limitations of the de-duplication algorithm.

When using Office 365 eDiscovery tools to export the results of an eDiscovery search, you have the option to de-duplicate the results that are exported. What does this mean? When you enable de-duplication (by default, de-duplication isn't enabled), only one copy of an email message is exported even though multiple instances of the same message might have been found in the mailboxes that were searched. De-duplication helps you save time by reducing the number of items that you have to review and analyze after the search results are exported. But it's important to understand how de-duplication works and be aware that there are limitations to the algorithm that might cause a unique item to be marked as a duplicate during the export process.

Content

How duplicate messages are identified

Limitations of the de-duplication algorithm

More information

How duplicate messages are identified

Office 365 eDiscovery tools use a combination of the following email properties to determine whether a message is a duplicate:

  • InternetMessageId    This property specifies the Internet message identifier of an email message, which is a globally unique identifier that refers to a specific version of a specific message. This ID is generated by the sender's email client program or host email system that sends the message. If a person sends a message to more than one recipient, the Internet message ID will be the same for each instance of the message. Subsequent revisions to the original message will receive a different message identifier.

  • ConversationTopic    This property specifies the subject of the conversation thread of a message. The value of the ConversationTopic property is the string that describes the overall topic of the conversation. A conservation consists of an initial message and all messages sent in reply to the initial message. Messages within the same conversation have the same value for the ConversationTopic property. The value of this property is typically the Subject line from the initial message that spawned the conversation.

  • BodyTagInfo    This is an internal Exchange store property. The value of this property is calculated by checking various attributes in the body of the message. This property is used to identify differences in the body of messages.

During the eDiscovery export process, these three properties are compared for every message that matches the search criteria. If these properties are identical for two (or more) messages, those messages are determined to be duplicates and the result is that only one copy of the message will be exported if de-duplication is enabled. The message that is exported is known as the "source item". Information about duplicate messages is included in the Results.csv and Manifest.xml reports that are included with the exported search results. In the Results.csv file, a duplicate message is identified by having a value in the Duplicate to Item column. The value in this column matches the value in the Item Identity column for the message that was exported.

The following graphics show how duplicate messages are displayed in the Results.csv and Manifest.xml reports that are exported with the search results. These reports don't include the email properties previously described, which are used in the de-duplication algorithm. Instead, the reports include the Item Identity property that is assigned to items by the Exchange store. 

Results.csv report (viewed in Excel)

Viewing info about duplicate items in the Results.csv report

Manifest.xml report (viewed in Excel)

Viewing info about duplicate items in the Manifest.xml report

Additionally, other properties from duplicate messages are included in the export reports. This includes the mailbox the duplicate message is located in, whether the message was sent to a distribution group, and whether the message was Cc'd or Bcc'd to another user.

Return to top

Limitations of the de-duplication algorithm

There are some known limitations of the de-duplication algorithm that might cause unique items to get marked as duplicates. It's important to understand these limitations so you can decide whether or not to use the optional de-duplication feature.

There's one situation where the de-duplication feature might mistakenly identify a message as a duplicate and not export it (but still cite it as a duplicate in the export reports). These are messages that a user edits but doesn't send. For example, let's say a user selects a message in Outlook, copies the contents of the message, and then pastes it in a new message. Then the user changes one of the copies by removing or adding an attachment, or changing the subject line or the body itself. If these two messages match the query of an eDiscovery search, only one of the messages will be exported if de-duplication is enabled when the search results are exported. So even though the original message or the copied message was changed, neither of the revised messages were sent and therefore the values of InternetMessageId, ConversationTopic and BodyTagInfo properties weren't updated. But as previously explained, both messages will be listed in the export reports

Note that unique messages can also be marked as duplicates when the Copy-on-Write page protection feature is enabled, as in the case of a mailbox being on Litigation Hold or In-Place Hold. The Copy-on-Write feature copies the original message (and saves it in the Versions folder of the user's Recoverable Items folder) before the revision to original item is saved. In this case, the revised copy and the original message (in the Recoverable Items folder) might be considered as duplicate messages and therefore only one of them would be exported.

Important: If the limitations of the de-duplication algorithm might impact the quality of your search results, then you shouldn't enable de-duplication when you export items. If the situations described in this section are unlikely to be a factor in your search results, and you want to reduce the number of items most likely to be duplicates, then you should consider enabling de-duplication.

Return to top

More information

Return to top

Expand your skills
Explore training
Get new features first
Join Office Insiders

Was this information helpful?

Thank you for your feedback!

Thank you for your feedback! It sounds like it might be helpful to connect you to one of our Office support agents.

×