Organic SEO Blog

231-922-9460 • Contact UsFree SEO Site Audit

Thursday, February 10, 2005

Recent Interview with Google Director of Search Quality - Peter Norvig
Peter Norvig confirms that content is king and meta data cannot be Trusted

Advancements being made to target cloaking, sneaky redirects, and sites with misleading meta tags. The closing paragraph of this interview is a must for any party involved in the marketing of a website.

Semantic Web Ontologies: What Works and What Doesn't Google's director of search quality discusses challenges of automation, knowledge, spam, and more.

Peter Norvig: (Mr. Norvig is director of search quality at Google.)
The leading four individual challenges at Google Organic Search.

First is a chicken-and-egg problem: How do we build this information, because what's the point of building the tools unless you got the information, and what's the point of putting the information in there unless you have tools. A friend of mine just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go.

The next problem is competing ontologies. Everybody's got a different way to look at it. You have some tools to address it. We'll see how far that will scale. Then the Cyc problem, which is a problem of background knowledge, and the spam problem. That's something I have to face every day. As you get out of the lab and into the real world, there are people who have a monetary advantage to try to defeat you.

So, the chicken-and-egg problem. That's "What interesting information is in these kind of semantic technologies, and where is the other information?" It turns out most of the interesting information is still in text. What we concentrate on is how do you get it out of text. Here's an example of a little demo called IO Knot. You can type a natural language question, and it pulls out documents from text and pulls out semantic entities. And you see, it's not quite perfect—couldn't quite resolve the spelling problem. But this is all automated, so there's no work in putting this information into the right place.

In general, it seems like semantic technology is good for defining schemas, but then what goes into the schemas. There's a lot of work to get it there. Here's another example. This is a Google News page from last night, and what we've done here is apply clustering technology to put the news stories together in categories, so you see the top story there about Blair, and there're 658 related stories that we've clustered together. Now imagine what it would be like if instead of using our algorithms we relied on the news suppliers to put in all the right metadata and label their stories the way they wanted to. "Is my story a story that's going to be buried on page 20, or is it a top story? I'll put my metadata in. Are the people I'm talking about terrorists or freedom fighters? What's the definition of patriot? What's the definition of marriage?"Just defining these kinds of ontologies when you're talking about these kinds of political questions rather than about part numbers; this becomes a political statement.

People get killed over less than this. These are places where ontologies are not going to work. There's going to be arguments over them. And you've got to fall back on some other kinds of approaches. The best place where ontologies will work is when you have an oligarchy of consumers who can force the providers to play the game. Something like the auto parts industry, where the auto manufacturers can get together and say, "Everybody who wants to sell to us do this." They can do that because there's only a couple of them. In other industries, if there's one major player, then they don't want to play the game because they don't want everybody else to catch up. And if there's too many minor players, then it's hard for them to get together.

Semantic technologies are good for essentially breaking up information into chunks. But essentially you get down to the part that's in between the angle brackets. And one of our founders, Sergey Brin, was quoted as saying, "Putting angle brackets around things is not a technology by itself." The problem is what goes into the angle brackets. You can say, "Well, my database has a person name field, and your database has a first name field and a last name field, and we'll have a concatenation between them to match them up." But it doesn't always work that smoothly.

Here's an example of a couple days' worth of queries at Google for which we've spelling-corrected all to one canonical form. It's one of our more popular queries, and there were something like 4,000 different spelling variations over the course of a week. Somebody's got to do that kind of canonicalization. So the problem of understanding content hasn't gone away; it's just been forced down to smaller pieces between angle brackets. So there's a problem of spelling correction; there's a problem of transliteration from another alphabet such as Arabic into a Roman alphabet; there's a problem of abbreviations, HP versus Hewlett Packard versus Hewlett-Packard, and so on. And there's a problem with identical names: Michael Jordan the basketball player, the CEO, and the Berkeley professor.

And now we get to this problem of background knowledge. Cyc project went about trying to define all the knowledge that was in a dictionary, a Dublin Core type of thing, and then found what we need was the stuff that wasn't in the dictionary or encyclopedia. Lenat and Guha said there's this vast storehouse of general knowledge that you rarely talk about, common-sense things like, "Water flows downhill" and "Living things get diseases." I thought we could launch a big project to try to do this kind of thing. Then I decided to simplify a little—just put quote marks around it and type it in. So I typed "water flows downhill" and I got 1,200 hits. [That first hit] says, "lesson plan by Emily, kindergarten teacher." It actually explains why water flows downhill, and it's the kind of thing that you don't find in an encyclopedia. The conclusion here is Lenat was 99.999993% right, because only 1,200 out of those 4.3 billion cases actually talked about water flowing downhill.

But that's enough, and you can go on from there. You can use the web to do voting, so you say this pump goes uphill and that only happens 275, so the downhill wins, 1,200 to 275.Essentially what we're doing here is using the power of masses of untrained people who you aren't paying to do all your work for you, as opposed to trying to get trained people to use a well-defined formalism and write text in that formalism and let's just use the stuff that's already out there. I'm all for this idea of harvesting this "unskilled labor" and trying to put it to use using statistical techniques over masses of large data and filtering through that yourself, rather than trying to closely define it on your own.

The last issue is the spam issue. When you're in the lab and you're defining your ontology, everything looks nice and neat. But then you unleash it on the world, and you find out how devious some people are. This is an example; it looks like two pages here. This is actually one page. On the left is the page as Googlebot sees it, and on the right is a page as any other user agent sees it. This website—when it sees Googlebot.com, it serves up the page that it thinks will most convince us to match against it, and then when a regular user comes, it shows the page that it wants to show (CLOAKING).

What this indicates is, one, we've got a lot of work to do to deal with this kind of thing (CLOAKING), but also you CAN'T TRUST THE METADATA.

You can't trust what people are going to say.

In general, search engines have turned away from metadata, and they try to hone in more on what's exactly perceivable to the user.

For the most part we throw away the meta tags, unless there's a good reason to believe them, because they tend to be more deceptive than they are helpful.

And the more there's a marketplace in which people can make money off of this deception, the more it's going to happen.

Humans are very good at detecting this kind of spam, and machines aren't necessarily that good.

So if more of the information flows between machines, this is something you're going to have to look out for more and more.

This text is excerpted from SDForum's Semantic Technologies Seminar, cohosted by www.AlwaysOn-network.com

--

Comments From Jack Roberts Peak Positions, LLC

Peter Norvig is absolutely correct in that meta data cannot be Trusted.

It seems that refining page-text semantic matching systems and extending algorithm crawling capabilities to focus strictly on page text has to become a reality.

Page Text or Text Content is all that can be fully trusted.

Filtering and sorting search results soley on page text and employing robot agents using anonymous ips to detect cloaked urls ensures significant improvements in search quality and results page relevance.

One can only hope that Post-IPO Google will remain focused on the quality of their organic keyword search results. All external indications are that quality imporvements are on the horizon at Google.

Jack Roberts
Vice President, Director of Client Services
Peak Positions, LLC
http://www.peakpositions.com


*Join Peak Positions, Kansas City Star and several senior level search engineers from Google at our upcoming private Search Engine Optimization Seminar in Kansas City, Missouri (USA) - Spring 2005.