drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Grabbing Random Sample of rows based on a column
Date Mon, 28 Nov 2016 22:56:06 GMT
OK. Suppose that you have a column called random_key in the main table t1.
Suppose that you have a second table t2 with two columns, random_key and p.

This should do your job:

select * from t1 join t2 using random_key where random() < t2.p;

This will give you an extra column p (relative to t1). If we had
expressions like t1.* we could limit that.

Clearly t2 doesn't have to be pre-defined.

In fact, here is a query that should work to sample about 100 of different
ranges of b

with t1 as  (select round(b) as k, random() z, b from maprfs.ted.`xyz/*`) ,
     t2 as  (select k, count(k) cnt from t1 group by k),
     t3 as  (select k, case when cnt < 100 then 1.0 else 100.0/cnt end p
from t2),
     t4 as  (select t1.z as z, t1.b as b, t1.k as k, t3.p as p from t1 join
t3 on (t1.k = t3.k))
select * from t4 where z < p;

xyz has lots of data in it:

0: jdbc:drill:> with t1 as  (select round(b) as k, rand() z, b from
maprfs.ted.`xyz/*`)
. . . . . . . > select k, count(k) cnt from t1 group by k;
+-------+---------+
|   k   |   cnt   |
+-------+---------+
| 1.0   | 306696  |
| 2.0   | 26540   |
| -2.0  | 188484  |
| -3.0  | 9852    |
| -4.0  | 8       |
| 3.0   | 156     |
| 0.0   | 841312  |
| -1.0  | 724104  |
+-------+---------+

Here is an example of how well this query does. Note that there is no way
to get more -4's since there are only 8 in the original data.

0: jdbc:drill:> with t1 as  (select round(b) as k, random() z, b from
maprfs.ted.`xyz/*`) ,
. . . . . . . >      t2 as  (select k, count(k) cnt from t1 group by k),
. . . . . . . >      t3 as  (select k, case when cnt < 10 then 1.0 else
100.0/cnt end p from t2),
. . . . . . . >      t4 as  (select t1.z as z, t1.b as b, t1.k as k, t3.p
as p from t1 join t3 on (t1.k = t3.k))
. . . . . . . > select k,count(1) from t4 where z <= p group by k order by k
. . . . . . . > ;
+-------+---------+
|   k   | EXPR$1  |
+-------+---------+
| -4.0  | 8       |
| -3.0  | 107     |
| -2.0  | 91      |
| -1.0  | 104     |
| 0.0   | 93      |
| 1.0   | 110     |
| 2.0   | 105     |
| 3.0   | 92      |
+-------+---------+
8 rows selected (2.169 seconds)


Unfortunately, there is a bug in the rand() function (this is a very old
bug). Use random() instead.

0: jdbc:drill:> select round(b) as k, rand() bad, random() z, b from
maprfs.ted.`xyz/*` limit 10;
+-------+----------------------+----------------------+----------------------+
|   k   |         bad          |          z           |          b
  |
+-------+----------------------+----------------------+----------------------+
| 1.0   | 0.46045930988550243  | 0.5743224359285806   | 0.9330797837593181
  |
| 0.0   | 0.46045930988550243  | 0.314046977888911    | 0.2428613780673916
  |
| 1.0   | 0.46045930988550243  | 0.21483988823160682  | 0.6581640175620538
  |
| 1.0   | 0.46045930988550243  | 0.2660185130331417   | 0.5177771442646559
  |
| 1.0   | 0.46045930988550243  | 0.31355327439484393  | 1.1932869828247252
  |
| 1.0   | 0.46045930988550243  | 0.8150798519837753   | 0.5030685771327987
  |
| 1.0   | 0.46045930988550243  | 0.08149861547426307  | 0.9183712166274609
  |
| 1.0   | 0.46045930988550243  | 0.5245412734834898   | 0.777984343330063
 |
| -1.0  | 0.46045930988550243  | 0.9605005092215819   | -1.141443398317655
  |
| -2.0  | 0.46045930988550243  | 0.4892755331157771   | -1.8316618040095816
 |
+-------+----------------------+----------------------+----------------------+
10 rows selected (0.434 seconds)



On Mon, Nov 28, 2016 at 1:35 PM, John Omernik <john@omernik.com> wrote:

> So I may have data that is 20 columns wide, what I am looking for is based
> on a single column, pick a random sampling from that column, but return the
> the whole row... I guess it doesn't matter much on the column, random is
> random, I just want the whole row, not just just a a single column.  Two
> passes wouldn't be horrible, I was just trying to see how others approach
> this problem.
>
> On Mon, Nov 28, 2016 at 2:58 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > What does "grab random form that column"?
> >
> > Does it mean use that row to determine the probability of picking the
> row?
> >
> > How bad is it to make two passes through the data?
> >
> >
> >
> > On Mon, Nov 28, 2016 at 11:19 AM, John Omernik <john@omernik.com> wrote:
> >
> > > This is less about a random sampling of one column, more about based
> on a
> > > column, grab random from that column, but return the whole row...
> > >
> > > On Mon, Nov 28, 2016 at 11:45 AM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > > >
> > > > The answer is probably yes. (Get it?)
> > > >
> > > > If you just want a random sample of one column, try random() < p as
a
> > > > qualifier in the where clause.
> > > >
> > > > If you want samples where the likelihood varies with the value of a
> > > > column, the answer is slightly more elaborate.  For instance, suppose
> > you
> > > > want about a thousand samples from each city in the data. This means
> > that
> > > > you should have p=1 for all cities where there are less than a
> thousand
> > > > samples at all and p=1000/n where n is the number of samples for the
> > > > current city. So what you want is a two pass query that counts the
> > cities
> > > > and then uses these counts to get probabilities. I am not up for
> typing
> > > > that on a phone, but it should be straightforward.
> > > >
> > > > This same task can be done in a single pass by using what is called
> > > > reservoir sampling. You can use two levels of reservoir sampling
> with a
> > > > counter to bias the results but that will require a user defined
> > > aggregator
> > > > that can work on two levels and I don't think that is possible/easy
> yet
> > > > with drill.
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Nov 28, 2016, at 8:27, John Omernik <john@omernik.com> wrote:
> > > > >
> > > > > Is there a way to grab a random return of data from Drill?
> > > > >
> > > > > For example, let's say I have a table with 1 billion rows, and I
> want
> > > to
> > > > > return 100,000 at random based on a sampling of a specific
> column...
> > is
> > > > > that possible?
> > > > >
> > > > > Thanks
> > > > >
> > > > > John
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message