Craig Ball does a great job describing how hash values are created, and used to deduplicate identical copies of documents, and also how that technology would fail to identify the same content existing in different types of files. That’s why having a near-duplicate tool is also a good thing. It can help you find the same content in different files, or very similar content across multiple files by analyzing the content as opposed to just the hash values.
That’s an important part of any eDiscovery workflow, finding out where content is being used in different document formats, or being altered slightly when creating a new copy of a file. Imagine, for example, an employee leaking corporate information. A hash-match may find where she has been sending exact copies of the files out, near-duplication technology would also help locate where she scanned a printed copy, or created a PDF, or made a small change to the data before sending it on.