Thursday, March 25, 2010

Stream Sampling Problem

Assume you are processing a large stream of N distinct elements (e.g., strings, characters, numbers, etc). Your goal is to sample a set of K elements (without replacement) uniformly at random (such that all N choose K sets are sampled equally often). Devise a strategy that requires only one pass through the stream has space complexity that does not depend on N.

7 comments:

Daniel said...

Reservoir sampling.

Yisong said...

I guess these questions aren't that interesting if you already know the answer in advance =)

Nikos said...

Technically, you cannot do it without any dependence on n, even reservoir sampling needs to store the index of the current item which requires log(n) bits.

Yisong said...

Maybe I'm mistaken about what reservoir sampling is (I hadn't heard of the term before). But if the elements arrive in a stream, there isn't a need to keep an index.

Nikos said...

This is a pretty good description: http://blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx

Do you have a radically different solution in mind?

Yisong said...

No, that was exactly the solution I arrived at when I first considered the problem.

What I meant to say was that you don't need to store the index of element if you just store the element itself. After all, storing the index implies you'd want to reprocess the stream in order to make use of your sample.

Of course, now that I think about it, the storage size of each individual element is at least log(N) since the problem statement said N unique items. So I guess you're right.

John Carrino said...

Also, you are keeping track of the position j. You need lg(N) bits to represent j. I love this question and use it for interviews. I ask a couple streaming algorithms questions.

Finding one random element uses constant space in the same way that hashtables are constant time.