Tuesday, June 2, 2009

Who's the Boss, Steinbrenner or Springsteen?

As I started playing with Mathematica when it first came out (too late for me to use it for the yucky path integrals in my dissertation), I just had to try Wolfram|Alpha. The vanity search didn't work; assuming that's what most people find, its probably the death knell for W|A as a search engine. Starting with something more appropriately nerdy, I asked W|A about "Star Trek"; it responded with facts about the new movie, and suggested some other movies I might mean, apparently unaware that there was a television show that preceded it. Looking for some subtlety with a deliberately ambiguous query, I asked about "House" and it responded "Assuming "House" is a unit | Use as a surname or a character or a book or a movie instead". My whole family is a big fan of Hugh Laurie, so I clicked on "character" and was very amused to see that to Wolfram|Alpha, the character "House" is Unicode character x2302, "⌂". Finally, not really expecting very much, I asked it about the Boss.

In New Jersey, where I live, there's only one person who is "The Boss", and that's Bruce Springsteen. If you leave off the "The", and you're also a Yankees fan, then maybe George Steinbrenner could be considered a possible answer, and Wolfram|Alpha gets it exactly right. Which is impressive, considering that somewhere inside Wolfram|Alpha is Mathematica crunching data. The hype around Wolfram|Alpha is that it runs on a huge set of "curated data", so this got me wondering what sort of curated dataset knows who "The Boss" really is. To me, "curated" implies that someone has studied and evaluated each component item and somehow I doubt that anyone at Wolfram has thought about the boss question

The Semantic Web community has been justifiably gushing about "Linked Data", and the linked datasets available are getting to be sizable. One of the biggest datasets is "DBpedia". DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. According to its "about" page, the dataset describes 2.6 million "things", and is currently comprised of 274 million RDF triples. It may well be that Wolfram Alpha has consumed this dataset and entered facts about Bruce Springsteen into its "curated data" set. (The Wikimedia Foundation is listed as a reference on its "the boss" page.) If you look at Bruce's Wikipedia page, you'll see that "The Boss" is included as the "Alias" entry in the structured information block that you see if you pull up the "edit this page" tab, so the scenario seems plausible.

Still, you have to wonder how any machine can consume lots of data and make good judgments about who is "The Boss". Wikipedia's "Boss" disambiguation page lists 74 different interpretations of "The Boss". Open Data's Uriburner has 1327 records for "the boss" (1676 triples, 1660 properties), but I can't find the Alias relationship to Bruce Springsteen. How can Wolfram|Alpha, or indeed any agent trying to make sense of the web of Linked Data, deal with this ever-increasing flood of data?

Two weeks ago, I had the fortune to spend some time with Atanas Kiryakov, the CEO of Ontotext, a Bulgarian company that is a leading developer of core semantic technology. Their product OWLIM is claimed to be "the fastest and most scalable RDF database with OWL inference", and I don't doubt it, considering the depth of understanding that Mr. Kiryakov displayed. I'll write more about what I learned from him, but for the moment I'll just focus on a few bits I learned about how semantic databases work. The core of any RDF- based database is a triple-store; this might be implemented as a single huge 3 column data table in a conventional database management software; I'm not sure exactly what OWLIM does, but it can handle a billion triples without much fuss. When a new triple is added to the triple store, the semantic database also does "inference". In other words, it looks at all the data schemas related to the new triple, and from them, it tries to infer all the additional triples implied by the new triple. So if you were to add a triples ("I", "am a fan of", "my dog") and ("my dog", "is also known as", "the Boss"), then a semantic database will add these triples, and depending on the knowledge model used, it might also add a triple for ("I", "am a fan of", "the Boss"). If the data base has also consumed "is a fan of" data for millions of other people, then it might be able to figure out with a single query that Bruce Springsteen, with a million fans, is a better answer to the question "Who is known as 'the Boss'" than your dog, who, though very friendly, has only one fan.

As you can imagine, a poorly designed data schema can result in explosions of data triples. For example, you would not want your knowledge model to support a property such as "likes the same music" because then the semantic database would have to add a triple for every pair of persons that like the same music- if a million people liked Bruce Springsteen's music, you would need a trillion triples to support the "likes the same music" property. So part of the answer to my question about how software agents can make sense of linked data floods is that they need to have well thought-out knowledge models. Perhaps that's what Wolfram means when they talk about "curated datasets".

0 comments:

Contribute a Comment

Note: Only a member of this blog may post a comment.