drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Karpinski <mkarpin...@opera.com>
Subject Re: Counting large numbers of unique values
Date Tue, 07 Apr 2015 16:19:27 GMT
@ Ted, ideally, I'd like to get exact results, but in case of real
problems, we could perhaps settle on approximate counting. Is there already
such a functionality in Drill?

Cheers,
Marcin

On Tue, Apr 7, 2015 at 5:20 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> How precise do your counts need to be?  Can you accept a fraction of a
> percent statistical error?
>
>
>
> On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asinha@maprtech.com> wrote:
>
> > Drill already does most of this type of transformation.  If you do an
> > 'EXPLAIN PLAN FOR <your count(distinct) query>'
> > you will see that it first does a grouping on the column and then applies
> > the COUNT(column).  The first level grouping can be done either based on
> > sorting or hashing and this is configurable through a system option.
> >
> > Aman
> >
> > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <mkarpinski@opera.com>
> > wrote:
> >
> > > Hi Guys,
> > >
> > > I have a specific use case for Drill, in which I'd like to be able to
> > count
> > > unique values in columns with tens millions of distinct values. The
> COUNT
> > > DISTINCT method, unfortunately, does not scale both time- and
> memory-wise
> > > and the idea is to sort the data beforehand by the values of that
> column
> > > (let's call it ID), to have the row groups split at new a new ID
> boundary
> > > and to extend Drill with an alternative version of COUNT that would
> > simply
> > > count the number of times the ID changes through out the entire table.
> > This
> > > way, we could expect that counting unique values of pre-sorted columns
> > > could have complexity comparable to that of the regular COUNT operator
> (a
> > > full scan). So, to sum up, I have three questions:
> > >
> > > 1. Can such a scenario be realized in Drill?
> > > 2. Can it be done in a modular way (eg, a dedicated UDAF or an
> operator),
> > > so without heavy hacking throughout entire Drill?
> > > 3. How to do it?
> > >
> > > Our initial experience with Drill was very good - it's an excellent
> tool.
> > > But in order to be able to adopt it, we need to sort out this one
> central
> > > issue.
> > >
> > > Cheers,
> > > Marcin
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message