Thursday, October 28, 2010

An Implication of Heavy-Tailed Distributions

What is the probability that a search engine user that issued one query will issue a second query in the same session? For example, the user could first search for "best cars to buy" and then follow up with a search for "Toyota vs Honda". Similarly, what is the probability that a search engine user that issued two queries will issue a third query in the same session? How do these two probabilities compare?

There are three possible answers to the last question posed above. The first probability could either be greater than, less than, or equal to the second probability. To answer this question, one needs to consider the distribution of issued queries in a given query session, which is known to follow a heavy-tailed distribution

One of the most prevalent phenomena in our society is abundance of heavy-tailed distributions. For those unfamiliar with the concept, a heavy-tailed distribution is simply one whose tail is not exponentially bounded. Heavy-tailed distributions arise basically everywhere. For example, the distribution of phone call activities is almost always heavy-tailed for any call network (e.g., the the number of people who made X calls per month, F(X), is heavy-tailed in a dataset I'm examining). The number of citations any research paper receives also follows a heavy-tailed distribution.

Coming back to search engine users, let P2 denote the probability of a user issuing a second query conditioned on already having issued one query, and let P3 denote the probability of a user issuing a third query conditioned on already having issued two queries. Then, assuming that query session length follows a heavy-tailed distribution, we can conclude that P3 > P2. In fact, any distribution which results in P3 being equal to or less than P2 must have an exponentially bounded tail.

What this means is that, the more queries a user issues, the more confident we can be that the user will continue issuing more queries in this particular search session. This certainly has implications on how we can think about automatically incorporating session-based memory into information services such as search engines. And in fact, if we can predict when the user is about to engage in a long information gathering session, then the search engine can potentially preemptively show the user a more diverse set of results (i.e., exploration) so that the user's feedback (e.g., which results she clicks on) can help the search engine produce more relevant results in down-stream queries.

This topic came up during a discussion with Paul Bennett.

No comments: