drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: Grabbing Random Sample of rows based on a column
Date Mon, 28 Nov 2016 19:19:18 GMT
This is less about a random sampling of one column, more about based on a
column, grab random from that column, but return the whole row...

On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> The answer is probably yes. (Get it?)
> If you just want a random sample of one column, try random() < p as a
> qualifier in the where clause.
> If you want samples where the likelihood varies with the value of a
> column, the answer is slightly more elaborate.  For instance, suppose you
> want about a thousand samples from each city in the data. This means that
> you should have p=1 for all cities where there are less than a thousand
> samples at all and p=1000/n where n is the number of samples for the
> current city. So what you want is a two pass query that counts the cities
> and then uses these counts to get probabilities. I am not up for typing
> that on a phone, but it should be straightforward.
> This same task can be done in a single pass by using what is called
> reservoir sampling. You can use two levels of reservoir sampling with a
> counter to bias the results but that will require a user defined aggregator
> that can work on two levels and I don't think that is possible/easy yet
> with drill.
> Sent from my iPhone
> > On Nov 28, 2016, at 8:27, John Omernik <john@omernik.com> wrote:
> >
> > Is there a way to grab a random return of data from Drill?
> >
> > For example, let's say I have a table with 1 billion rows, and I want to
> > return 100,000 at random based on a sampling of a specific column... is
> > that possible?
> >
> > Thanks
> >
> > John

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message