Promise and Drawbacks of Data Mining

When I was in Washington State last week, I saw a demo of EntityCube from Microsoft Research. You can see it for yourself at http://entitycube.research.microsoft.com. Here’s a description of Entity Cube from it’s ReadMe page:

EntityCube generates summaries of Web entities from billions of public Web pages that contain information about people, locations, and organizations, and allows for exploration of their relationships. For example, users can use EntityCube to find an automatically generated biography page and social-network graph for a person, and use it to discover a relationship path between two people.

There are also a number of caveats on this page that you’ll want to read before fully exploring this tool on your own.

The demo I saw focused on famous people and it seemed to work pretty well. Try George Clooney. Not the example I was given, but it seems to work.

Naturally, I’m enough of an egoist that I decided to put my own name into EntityCube. And here’s where things got really interesting for me since I believe I have a good picture of my own social network.

EntityCube pulled up a mostly correct biography for me. It supposedly uses a static set of web pages rather than the current internet, so that could account for me being listed one position behind the one I currently hold.

It correctly identified me as being associated with James Jacobs, Jim Jacobs and Shinjoung Yeo – my partners at Free Government Information. It also correctly associated me with Jim Simard and Sue Sherif, two of my work colleagues. So far so good.

It also identified me as being associated with Rebecca Moorman, a smart and nice technical services librarian that I haven’t spoken to in ten years since she left Juneau. It also connects me with a Velma Wallis whose only connection I can think of is that we were both on the speaker schedule at the Alaska Library Association conference in Barrow in 2005.  EntityCube also failed to turn up some very important relationships in my life. EntityCube missed finding my spouse and my two best friends.

Saving the best mischaracterization for last, EntityCube believes that I’m associated with Thomas Jefferson. In fact, when I tried the “six degrees of separation” tool to see what connection I might have to President Obama, the chain started with Thomas Jefferson and went through George Washington and Abraham Lincoln before getting to the current President. Clearly, on the internet, no one knows you are dead.

The point of this post isn’t to knock Microsoft. The “readme” page I referenced earlier says that this is a work in progress and that the information found through EntityCube could be wrong and shouldn’t be relied on.

What does concern me is the knowledge that state and federal government have been turning to data mining in a big way in an effort to locate terrorists and criminals before they strike. The systems they use are classified for the most part. But how do you and I know that they are so much better than EntityCube, which was put together by some very sharp computer programmers. How many erroneous conclusions are being drawn — and acted upon? Is this diversion of resources hurting the fight against crime and terrorism?

