drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: Grabbing Random Sample of rows based on a column
Date Mon, 28 Nov 2016 21:35:10 GMT
So I may have data that is 20 columns wide, what I am looking for is based
on a single column, pick a random sampling from that column, but return the
the whole row... I guess it doesn't matter much on the column, random is
random, I just want the whole row, not just just a a single column.  Two
passes wouldn't be horrible, I was just trying to see how others approach
this problem.

On Mon, Nov 28, 2016 at 2:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> What does "grab random form that column"?
>
> Does it mean use that row to determine the probability of picking the row?
>
> How bad is it to make two passes through the data?
>
>
>
> On Mon, Nov 28, 2016 at 11:19 AM, John Omernik <john@omernik.com> wrote:
>
> > This is less about a random sampling of one column, more about based on a
> > column, grab random from that column, but return the whole row...
> >
> > On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > >
> > > The answer is probably yes. (Get it?)
> > >
> > > If you just want a random sample of one column, try random() < p as a
> > > qualifier in the where clause.
> > >
> > > If you want samples where the likelihood varies with the value of a
> > > column, the answer is slightly more elaborate.  For instance, suppose
> you
> > > want about a thousand samples from each city in the data. This means
> that
> > > you should have p=1 for all cities where there are less than a thousand
> > > samples at all and p=1000/n where n is the number of samples for the
> > > current city. So what you want is a two pass query that counts the
> cities
> > > and then uses these counts to get probabilities. I am not up for typing
> > > that on a phone, but it should be straightforward.
> > >
> > > This same task can be done in a single pass by using what is called
> > > reservoir sampling. You can use two levels of reservoir sampling with a
> > > counter to bias the results but that will require a user defined
> > aggregator
> > > that can work on two levels and I don't think that is possible/easy yet
> > > with drill.
> > >
> > > Sent from my iPhone
> > >
> > > > On Nov 28, 2016, at 8:27, John Omernik <john@omernik.com> wrote:
> > > >
> > > > Is there a way to grab a random return of data from Drill?
> > > >
> > > > For example, let's say I have a table with 1 billion rows, and I want
> > to
> > > > return 100,000 at random based on a sampling of a specific column...
> is
> > > > that possible?
> > > >
> > > > Thanks
> > > >
> > > > John
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message