So, Bing made their response to Google’s accusations of copying their results. I don’t think Bing were as clear in their response as they could have been.
Today I’m going to write the post that I think Bing should have written. To be clear – I have absolutely no relationship with either of these companies (beyond using gmail, analytics and adsense), I’m just putting this out here from my view of the situation. After this I’ll get back to blogging about cool web stuff, and stop ruining my chances of ever working for Google…
In the beginning, search engines used the text on a page to determine if that page is relevant to a query. This works well, especially when you consider the length of the page in relation to all other documents in your collection, and the relative keyword density within the page. This method of information retrieval is known in academia as TFxIDF.
The trouble with TFxIDF based search is that it was never designed to cope with people trying to deliberately manipulate the search results. By carefully choosing particular keywords in your document, you can pretty easily manipulate this ‘bag of words’ approach to search.
In 1998, Larry Page and Sergey Brin demonstrated the power of using the number of inbound links a web page has as a signal to determine the quality of that page. This they termed ‘PageRank’ in their seminal paper, “The Anatomy of a Large-Scale Hypertextual Web Search Engine“, and it formed the basis of their search engine Google.
A great side-effect of this kind of link-graph based approach is that it takes the measure of how relevant a page is out of the control of the author of that page – your rank is determined by how many other people link to you, which is much harder to game (though not impossible).
In the years following the rise of Google, every other search company adopted some form of PageRank, and the quality of search results across the board improved dramatically.
Today, you could build a reasonable enough search engine by just indexing lots of pages, calculating the link graph between them and combining the TFxIDF measure with data about the inbound links.
But to really make a search engine that’s head-and-shoulders above the competition, you have to start using other signals. For example, Google recently announced that they were starting to look at the speed of a page as a signal of quality. Bing use thousands of signals; inbound links, number of tweets, sentiment analysis, link sharing data and so on.
Bing’s motivation was: “We want to show good quality links on Bing. How do people express the quality of a link? By clicking on them!”. So for people who opted-in to the Bing toolbar, Microsoft started to collect information on which links they clicked. This info was anonymised, and sent to MS encrypted so no-one could spy on their users. So an associations between pages was built up – not just the association that a page links to 5 other pages, but also which of those 5 pages are more frequently clicked on. This gives Bing a relationship between the page the link appears on and the page the link points to. Bing did this for every site, whether it’s on a blog or a search engine or a shopping site. Bing collected info on what links people were clicking on, and used that as one of the signals to help determine page quality – the theory being that quality pages get more clicks. What’s also great about this is that, like PageRank, it takes control over search ranking away from the author – your rank is decided by other web users, so in theory it means less spam and a more relevant, high quality, search engine.
When Google set up their experiment, they created special pages on their site which contained words which didn’t exist anywhere else on the web, then they installed Bing toolbar and clicked on links from those ‘synthetic’ pages. Bing toolbar sent MS that data, just like it would for any page, and their system incorporated that clickstream data into the other signals just as normal.
But what that meant was that when Google then went to Bing and searched for those unique words, all the other signals didn’t have any input at all; the only data Bing had was from their clickstream signal, and so that was what the system used. It’s not surprising that Bing returned the same pages as Google did, because that was all the data that existed in the whole world about that query! Bing didn’t collect that data because it was from Google, Bing collected it because it was from Bing toolbar users.
Microsoft wrote a neat little tool to gather clickstream data from any site, and Google shouted because it also worked on their site. If they’d spoken to Bing first without making all these accusations, I think the whole situation could have been calmed down. But even so – now Bing are aware that the clickstream data can be taken advantage of like this, Bing will be looking deeply into how best to stop that happening.
So that’s what I’d have liked to see Bing say in their reply to Google. The fact that they weren’t as clear-cut in their denial as they could have been suggests that my picture of events is not quite correct, or that Bing had other things they wanted to communicate in their message.
Ok, I think I’m done on this issue now – just got tired of misinformation in comment threads everywhere on this subject.
UPDATE: Oh boy. Matt Cutts has just written a fantastic blog post with some further evidence. I have to swallow my pride here and say: It really looks like Bing were indeed specifically targeting Google.
Matt points to a Microsoft research paper which contains damning evidence:
we “reverse-engineer†the parameters from the URLs of these [query formulation] sessions, and deduce how each search engine encodes both a query and the fact that a user arrived at a URL by clicking on the spelling suggestion of the query
The paper is available here: Learning Phrase-Based Spelling Error Models from Clickthrough Data.
My apologies to Matt and Google. I think the data you initially presented wasn’t enough to support the conclusion you reached, and hence my blog post. But this revelation strongly points to Bing deliberately and specifically targeting Google result pages, which is very clearly copying from Google. Shame on you Bing, how could you have thought this was OK to do?
#1 by karma karl on February 3, 2011 - 4:59 pm
hey bro!
been reading your stuff this eveing….very cool, very well informed and very well written!
keep up the good work
x
#2 by Pete Hamill on February 3, 2011 - 5:02 pm
Before jumping to conclusions and apologizing to the HollierThanYou behemoth, please read MarkB’s comment (February 3, 2011 at 4:48 pm) on
http://www.mattcutts.com/blog/google-bing/
I found it to be the clearest rebuttal of the whole misdirection story.
#3 by sb on February 3, 2011 - 8:27 pm
Just want to point out that this paper seems to be from MSR, which is basically academia. There is no reason to believe that the Bing product actually makes use of the paper’s result or techniques.
That being said, I think the biggest failure from MS is not any ‘copying’ — because I personally don’t see how parsing URLs to extract query terms is actually copying — it’s about being clear as to what gets transmitted in these so ‘improvement programs’ that you opt-in when installing bing toolbar.
The biggest loser is not Google but the people who feel betrayed when they did something they didn’t explicitly accept to do (transmit their preferences of web sites to bing).
John Langford (from Yahoo) has an insightful blog post about this as well:
http://hunch.net/?p=1660
#4 by sb on February 3, 2011 - 8:39 pm
And I agree with Pete Hamill, MarkB’s comment is quite insightful.
Here’s my attempt at a direct link:
http://www.mattcutts.com/blog/google-bing/#comment-712677
#5 by Mythic Tech on February 27, 2011 - 12:23 am
This whole mess between the two of them is just crazy, thanks for the information it was a good read.
Trevor Seabrook
Mythic Tech