|

Linked – Using Near-Duplication to Dedupe Document Collections Can be Dangerous

Image by CPOA
Image by CPOA

“The three major distinctions are:-Per Family (email + attachment) vs. Per Document
Deduplication is performed on the family level, while near-duplication is performed on the document level.
Textual Analysis vs. File Analysis
Near-duplicate detection uses only the text AND white space to compare documents, but deduplication uses a set of criteria based on the actual metadata of the files.
Duplicates vs. Similarities
Deduplication removes identical document families, while near-duplicate detection groups documents together by similarity.”

Deduplication is not the same as identifying near-duplicates. On the other hand, there are a lot of reasons to do both, so long as you understand the differences, and the different things you are trying to accomplish with each.

I’m a big fan of using near duplication technologies to cluster together similar content. Our brains simply function better if we can focus on one subject at a time, so document review done in this manner is more efficient, period.

Using Near-Duplication to Dedupe Document Collections Can be Dangerous

Similar Posts

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.