Linked – Using Near-Duplication to Dedupe Document Collections Can be Dangerous


“The three major distinctions are:-Per Family (email + attachment) vs. Per Document
Deduplication is performed on the family level, while near-duplication is performed on the document level.
–Textual Analysis vs. File Analysis
Near-duplicate detection uses only the text AND white space to compare documents, but deduplication uses a set of criteria based on the actual metadata of the files.
–Duplicates vs. Similarities
Deduplication removes identical document families, while near-duplicate detection groups documents together by similarity.”
Deduplication is not the same as identifying near-duplicates. On the other hand, there are a lot of reasons to do both, so long as you understand the differences, and the different things you are trying to accomplish with each.
I’m a big fan of using near duplication technologies to cluster together similar content. Our brains simply function better if we can focus on one subject at a time, so document review done in this manner is more efficient, period.
Using Near-Duplication to Dedupe Document Collections Can be Dangerous
Follow these topics: Links, LitigationSupport