Thursday, January 13, 2005

Google Still Doesn't Trust Metadata

Semantic Web Ontologies: What Works and What Doesn't "A friend of mine just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go."

A lot of RDF is XML or N3 and ends with dot-XML, dot-N3, dot-RSS, dot-ZIP, dot-GZ, etc. - a lot of this data is going to be fairly invisible.

"The best place where ontologies will work is when you have an oligarchy of consumers who can force the providers to play the game. Something like the auto parts industry, where the auto manufacturers can get together and say, "Everybody who wants to sell to us do this." They can do that because there's only a couple of them. In other industries, if there's one major player, then they don't want to play the game because they don't want everybody else to catch up. And if there's too many minor players, then it's hard for them to get together."

P2P searching shows that you don't need a rich, top level ontology to be able to find songs by Britney Spears, Birtney Speares, or however you spell it. To find a good quality song you need a bit of metadata. To find a specific performance of a song you need even better metadata. This is how it can grow in a bottom up manner. RDF and OWL allow you to grow out. If you don't need to change the code to support new metadata, as your ontology grows, then that is a positive thing for users and developers.

"So there's a problem of spelling correction; there's a problem of transliteration from another alphabet such as Arabic into a Roman alphabet; there's a problem of abbreviations, HP versus Hewlett Packard versus Hewlett-Packard, and so on. And there's a problem with identical names: Michael Jordan the basketball player, the CEO, and the Berkeley professor."

HP/Hewlett Packard/Hewlett-Packard all cluster statistically together. This kind of technology is sufficiently sophisticated - it's not very different from deciding between what's spam and what's not. Or a human can do it or most likely a combination. Being able to tell which Michael Jordan you're talking about is a problem that is solved by metadata.

"What this indicates is, one, we've got a lot of work to do to deal with this kind of thing, but also you can't trust the metadata. You can't trust what people are going to say. In general, search engines have turned away from metadata, and they try to hone in more on what's exactly perceivable to the user."

I can trust my metadata and I might trust yours. People may try and cheat Google's ranking algorithms but won't cheat themselves. Where people care about their own metadata and have to rely on it, it will improve over time.

No comments: