Saturday, February 12, 2011

Having search engines learn from usage data is good for everyone.

As many already know by now, Google and Bing have been going back an forth regarding what Google perceives as cheating by Bing (see articles here, here, here, here, and here).

The situation can be summarized as follows.

  • Over the past few months, Google has discovered that many of Bing's search results have become disconcertingly similar to Google's.

  • Some Internet Explorer or toolbar users send usage data to Microsoft. This includes what search results they clicked when searching using Google. Google thinks perhaps Microsoft has been using this data to augment Bing's search results.

  • Google then conducted an experiment whereby some rare queries had bogus results manually injected into the search results. A number of Google employees then issued these searches and clicked on the bogus results while using Microsoft Internet Explorer.

  • After a short waiting period, a small fraction these bogus results appeared on Bing's search results as well. This confirms that, when other relevance signals are weak or non-existent, the only active signal in Bing's ranker seems to be whether or not users clicked on the search result on the Google results page.

  • As a result, Google is accusing Microsoft of cheating by copying Google's search results.

    My response: of course Bing uses that kind of data, how could they not use such a valuable resource? I actually find it surprising that Google didn't expect Microsoft to be doing this already.

    Numerous studies in the literature have provided us with overwhelming evidence that mining click data is one of the most useful signals for improving search quality. If I worked at Bing, I would be pushing to make use of such data.

    Many other search companies use click data on third party search results as well, for example Surf Canyon. Surf Canyon has an installable add-on that can dynamically re-rank search results on Google, Bing, Yahoo!, and Craigslist. This re-ranker is, of course, trained in part using click data harvested via their toolbar from users issuing queries on other search engines. Surf Canyon also has a native search engine, which I expect is also optimized using click data on Google's search results gathered from their own toolbars. That is basically the exact same thing as what Bing is doing. Now, none of these other search companies come close to the size of Bing, so maybe Google just didn't care or notice until Bing started doing this in a more obvious way.

    I personally think that I should own my search logs. If I am allowed to share my usage data with any company of my choosing, then I think that's a win for everyone (except perhaps for the company currently holding a monopoly over the usage data, of course). As mentioned elsewhere, this would lower the barrier for competition and innovation.

    In reality, clicks on search results make up only a small part of the equation. Suppose a Google Chrome user is sending usage data to Google. Google sees a log where the user

    1) issued a query on Google
    2) clicked on a search result
    3) immediately issued the same query on Bing
    4) clicked on a search result
    5) browsed around on the landing website for 15 minutes

    Would Google ignore this entry simply because it contains a Bing query? I think existence of the Bing query is almost beside the point in this case, because harvesting just clicks on search results does not tell the whole story. Usage data also includes the actions users take after leaving the search results page, which can be just as valuable (see example study). Leveraging usage data of all varieties is the future, and it benefits everyone.

    Mathieu said...

    A potential problem with learning from a competitor's results is that it can lead to "circular" learning. Let's say that Google learns from clicks on Bing results and Bing learns from clicks on Google results. This seems potentially dangerous to me as it means that Google and Bing don't really learn anything new, they will just brush up on what they already know.

    Yisong Yue said...

    That is definitely a concern, but I think we're still quite far from that point. Furthermore, if we start to integrate other forms of usage data (such as general browsing behavior), then we can incorporate signals far beyond just what results show up on Google's and Bing's results pages.

    Yisong Yue said...

    I should also add that the World Wide Web is not a static object. New content is being added all the time, which also diffuses some concerns regarding "circular" learning.