drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: Counting large numbers of unique values
Date Tue, 07 Apr 2015 22:45:09 GMT
ASAIK, Drill's planner currently does not expose the sort-ness of
underlying data; if the data is pre-sorted, Drill planner would not
recognize that, and still would require a sort operator for sort-based
aggregation. Part of the reason is that Drill does not have a centralized
meta-store, to keep track of the meta data of the data/tables/files.
Therefore, if the the underlying data does not expose the sort-ness,
planner does not have such knowledge.

We do plan to enhance the planer / storage plugin interface, such that
planner would be able to utilize the sort-ness / partition columns
information.



On Tue, Apr 7, 2015 at 3:25 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Marcin,
>
> They did comment.  The answer is that the default is to use hashed
> aggregation (which will be faster when there is lots of memory) with the
> option to use sort aggregation (which is basically what you were
> suggesting).
>
> Did you mean to suggest that your data is already known to be sorted and
> thus the sort step should be omitted?
>
>
> On Tue, Apr 7, 2015 at 3:21 PM, Marcin Karpinski <mkarpinski@opera.com>
> wrote:
>
> > @Jacques, thanks for the information - I'm definitely going to check out
> > that option.
> >
> > I'm also curious that none of you guys commented on my original idea of
> > counting distinct values by a simple aggregation of pre-sorted data - is
> it
> > because it doesn't make sense to you guys, or because you think your
> > suggestions are easier to implement?
> >
> > On Tue, Apr 7, 2015 at 5:55 PM, Jacques Nadeau <jacques@apache.org>
> wrote:
> >
> > > Two additional notes here:
> > >
> > > Drill can actually do an aggregation using either a hash table based
> > > aggregation or a sort based aggregation.  By default, generally the
> hash
> > > aggregation will be selected first.  However, you can disable hash
> based
> > > aggregation if you specifically think that a sort based aggregation
> will
> > > perform better for use case.  You can do this by running the command
> > ALTER
> > > SESSION SET `planner.enable_hashagg` = FALSE;
> > >
> > > We have always had it on our roadmap to implement an approximate count
> > > distinct function but haven't gotten to it yet.  As Ted mentions, using
> > > this technique would substantially reduce data shuffling and could be
> > done
> > > with a moderate level of effort since our UDAF interface is pluggable.
> > >
> > >
> > >
> > > On Tue, Apr 7, 2015 at 8:20 AM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> > >
> > > > How precise do your counts need to be?  Can you accept a fraction of
> a
> > > > percent statistical error?
> > > >
> > > >
> > > >
> > > > On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha <asinha@maprtech.com>
> > wrote:
> > > >
> > > > > Drill already does most of this type of transformation.  If you do
> an
> > > > > 'EXPLAIN PLAN FOR <your count(distinct) query>'
> > > > > you will see that it first does a grouping on the column and then
> > > applies
> > > > > the COUNT(column).  The first level grouping can be done either
> based
> > > on
> > > > > sorting or hashing and this is configurable through a system
> option.
> > > > >
> > > > > Aman
> > > > >
> > > > > On Tue, Apr 7, 2015 at 3:30 AM, Marcin Karpinski <
> > mkarpinski@opera.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Guys,
> > > > > >
> > > > > > I have a specific use case for Drill, in which I'd like to be
> able
> > to
> > > > > count
> > > > > > unique values in columns with tens millions of distinct values.
> The
> > > > COUNT
> > > > > > DISTINCT method, unfortunately, does not scale both time- and
> > > > memory-wise
> > > > > > and the idea is to sort the data beforehand by the values of
that
> > > > column
> > > > > > (let's call it ID), to have the row groups split at new a new
ID
> > > > boundary
> > > > > > and to extend Drill with an alternative version of COUNT that
> would
> > > > > simply
> > > > > > count the number of times the ID changes through out the entire
> > > table.
> > > > > This
> > > > > > way, we could expect that counting unique values of pre-sorted
> > > columns
> > > > > > could have complexity comparable to that of the regular COUNT
> > > operator
> > > > (a
> > > > > > full scan). So, to sum up, I have three questions:
> > > > > >
> > > > > > 1. Can such a scenario be realized in Drill?
> > > > > > 2. Can it be done in a modular way (eg, a dedicated UDAF or
an
> > > > operator),
> > > > > > so without heavy hacking throughout entire Drill?
> > > > > > 3. How to do it?
> > > > > >
> > > > > > Our initial experience with Drill was very good - it's an
> excellent
> > > > tool.
> > > > > > But in order to be able to adopt it, we need to sort out this
one
> > > > central
> > > > > > issue.
> > > > > >
> > > > > > Cheers,
> > > > > > Marcin
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message