Recently, though, I saw a good reminder of how much that machine learning is still dependent on the proper inputs, and of course, that means someone giving the machine the proper data.
You may have seen a recent Washington Post story about Family Tree Now, a website that seems to crawl various public databases and grab all sorts of information about people, when they were born, where they lived, etc. Yeah, it’s creepy to think about all of that information being collected up for the world to look at, and the Post article focuses on letting us know how to “opt-out” of that site.
Naturally, my wife and I decided to look ourselves up and see what the site knew about us, and clearly it had access to a lot of public records, it had a variety of address records going back years. Creepy? Sure. The site also had a list of possible relatives and associates, and that’s where someone seems to have made some poor choices when it came to inputs.
The first thing she noticed about her information was that the first possible relative listed, was my first wife. As you might imagine, she was not thrilled, or impressed with the AI. Clearly, Family Tree Now missed some public records, like my divorce! For myself, yeah my first wife was listed as a possible relative, as were her parents and siblings. Again, they missed a record, but fair enough. I also noticed a long list of potential associates, people who I had no connection to at all. Upon further inspection, I realized that much of that list seemed to be made up of people who lived at one of my former addresses, well after I had left. I’m not sure who decided that made for a potential associate.
In short, the technology to crawl through public records seems pretty decent, but maybe incomplete. The learning about what makes for a connection seems pretty illogical. But that all goes back to the programmers. The AI, I assume, was programmed to crawl, but someone didn’t include some records that would have made it clear that some family relationships had been annulled. It also used an overly simply logic to match up dates without looking at the end dates of residences. The machine knew that I lived somewhere in 1996-1997, and it knew I had a different address after that, it said so, but it was still looking at people from the same address 10 years later and assuming a connection. That’s a logical fallacy. The machine didn’t do that. 😉
Why is this important? Because whether you’re talking about Big Data analytics for business and marketing, or TAR in the eDiscovery industry, if the inputs and algorithms aren’t correct, you may end up with the wrong results. Don’t just assume the machine knows, make sure it’s measuring what you think it should be.