Understand document similarity

Note: Advanced eDiscovery requires an Office 365 E5 subscription for your organization. If you don't have that plan and want to try Advanced eDiscovery, you can sign up for a trial of Office 365 Enterprise E5.

In Advanced eDiscovery, Document Similarity is the minimal level of resemblance required for two documents to be considered as near-duplicates.

Tip: For most business applications, it is recommended to use a Similarity value of 60%-75%. For very poor quality optical character recognition (OCR) material, lower Similarity values can be applied.

Note: After it's set and run for a given case, the Similarity value cannot be changed.

Within a Near-duplicate (ND) set, there may be documents with a level of resemblance below the Similarity threshold. For a document to join an ND set, there must be at least one document in the ND set with a level of resemblance exceeding the Similarity.

For example, assume the Similarity is set to 80%, document F1 resembles document F2 at a level of 85%, and document F2 resembles document F3 at a level of 90%.

However, document F1 may resemble document F3 at a level of only 70%, which is below the threshold. Nonetheless, in this example, documents F1, F2, and F3 all appear in the one ND set. Similarly, using a Similarity value of 80%, we may have created two sets, EquiSet-1 and EquiSet-2. EquiSet-1 contains documents E1 and E2. Equiset-2 contains documents F1, F2, and F3.

The levels of resemblance are illustrated as follows:

Document similarity

Assume that another document, X1, is now inserted. The resemblance between X1 and E3 is 87%. Similarly, the resemblance between X1 and F1 is 92%. As a result, EquiSet -1, EquiSet -2, and X1 are now combined into one ND set.

Document Similarity

Note: If any two documents are assigned to one ND set, they will remain together in the same ND set, even if additional documents are added to the set or if the sets are merged.

After sets are merged, the Pivot document can change when new documents are added to a set.

See Also

Office 365 Advanced eDiscovery

Setting Analyze options

Setting ignore text

Setting Analyze advanced settings

Viewing Analyze results

Share Facebook Facebook Twitter Twitter Email Email

Was this information helpful?

Great! Any other feedback?

How can we improve it?

Thank you for your feedback!

×