drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Grabbing Random Sample of rows based on a column
Date Mon, 28 Nov 2016 17:45:08 GMT

The answer is probably yes. (Get it?)

If you just want a random sample of one column, try random() < p as a qualifier in the
where clause. 

If you want samples where the likelihood varies with the value of a column, the answer is
slightly more elaborate.  For instance, suppose you  want about a thousand samples from each
city in the data. This means that you should have p=1 for all cities where there are less
than a thousand samples at all and p=1000/n where n is the number of samples for the current
city. So what you want is a two pass query that counts the cities and then uses these counts
to get probabilities. I am not up for typing that on a phone, but it should be straightforward.

This same task can be done in a single pass by using what is called reservoir sampling. You
can use two levels of reservoir sampling with a counter to bias the results but that will
require a user defined aggregator that can work on two levels and I don't think that is possible/easy
yet with drill. 

Sent from my iPhone

> On Nov 28, 2016, at 8:27, John Omernik <john@omernik.com> wrote:
> Is there a way to grab a random return of data from Drill?
> For example, let's say I have a table with 1 billion rows, and I want to
> return 100,000 at random based on a sampling of a specific column... is
> that possible?
> Thanks
> John

View raw message