drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Grabbing Random Sample of rows based on a column
Date Mon, 28 Nov 2016 20:58:48 GMT
What does "grab random form that column"?

Does it mean use that row to determine the probability of picking the row?

How bad is it to make two passes through the data?



On Mon, Nov 28, 2016 at 11:19 AM, John Omernik <john@omernik.com> wrote:

> This is less about a random sampling of one column, more about based on a
> column, grab random from that column, but return the whole row...
>
> On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> >
> > The answer is probably yes. (Get it?)
> >
> > If you just want a random sample of one column, try random() < p as a
> > qualifier in the where clause.
> >
> > If you want samples where the likelihood varies with the value of a
> > column, the answer is slightly more elaborate.  For instance, suppose you
> > want about a thousand samples from each city in the data. This means that
> > you should have p=1 for all cities where there are less than a thousand
> > samples at all and p=1000/n where n is the number of samples for the
> > current city. So what you want is a two pass query that counts the cities
> > and then uses these counts to get probabilities. I am not up for typing
> > that on a phone, but it should be straightforward.
> >
> > This same task can be done in a single pass by using what is called
> > reservoir sampling. You can use two levels of reservoir sampling with a
> > counter to bias the results but that will require a user defined
> aggregator
> > that can work on two levels and I don't think that is possible/easy yet
> > with drill.
> >
> > Sent from my iPhone
> >
> > > On Nov 28, 2016, at 8:27, John Omernik <john@omernik.com> wrote:
> > >
> > > Is there a way to grab a random return of data from Drill?
> > >
> > > For example, let's say I have a table with 1 billion rows, and I want
> to
> > > return 100,000 at random based on a sampling of a specific column... is
> > > that possible?
> > >
> > > Thanks
> > >
> > > John
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message