|

Linked – Using Near-Duplication to Dedupe Document Collections Can be Dangerous

Image by CPOA
Image by CPOA

“The three major distinctions are:-Per Family (email + attachment) vs. Per Document
Deduplication is performed on the family level, while near-duplication is performed on the document level.
Textual Analysis vs. File Analysis
Near-duplicate detection uses only the text AND white space to compare documents, but deduplication uses a set of criteria based on the actual metadata of the files.
Duplicates vs. Similarities
Deduplication removes identical document families, while near-duplicate detection groups documents together by similarity.”

Deduplication is not the same as identifying near-duplicates. On the other hand, there are a lot of reasons to do both, so long as you understand the differences, and the different things you are trying to accomplish with each.

I’m a big fan of using near duplication technologies to cluster together similar content. Our brains simply function better if we can focus on one subject at a time, so document review done in this manner is more efficient, period.

Using Near-Duplication to Dedupe Document Collections Can be Dangerous

Similar Posts

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)