And for DISTINCT issues?
On 18 Jun 2015 15:16, "Jihoon Son" <ghoonson@gmail.com> wrote:
> As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for data
> cube and grouping sets.
> You can create any issues if you want. Even though some issues can be
> duplicated, it's ok.
> 2015년 6월 18일 (목) 오전 10:39, Atri Sharma <atri.jiit@gmail.com>님이
작성:
>
> > Do we have a ticket around that?
> > On 18 Jun 2015 15:07, "Jihoon Son" <jihoonson@apache.org> wrote:
> >
> > > It looks good to start.
> > > Any questions welcome!
> > >
> > > Jihoon
> > >
> > > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <atri.jiit@gmail.com>님이
작성:
> > >
> > > > So distinct aggregation is one area, thanks.
> > > >
> > > > I am trying to get enough knowledge of Internals of aggregation
> engine
> > > and
> > > > query planner to be able to work on rollup and cube so picking
> smaller
> > > > tickets first.
> > > > On 18 Jun 2015 02:42, "Jihoon Son" <jihoonson@apache.org> wrote:
> > > >
> > > > > As far as I know, there aren't any plans for improvement except in
> > > > distinct
> > > > > aggregation. I think that our code for distinct aggregation is too
> > > > > complicated, and the performance also should be improved.
> > > > >
> > > > > So, when you design the implementation of your algorithm on Tajo,
> you
> > > > don't
> > > > > have to consider distinct aggregation part, I think.
> > > > >
> > > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <atri.jiit@gmail.com>님이
작성:
> > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Is there any improvement in aggregates that we are looking at
> > please?
> > > > > > On 16 Jun 2015 17:07, "Jihoon Son" <jihoonson@apache.org>
wrote:
> > > > > >
> > > > > > > In Tajo, aggregation is very similar to that in Hadoop
> MapReduce.
> > > > > > > Let me consider an example. Given a query of "select *k*,
> > count(*)
> > > > from
> > > > > > *t*
> > > > > > > group by *k*", Tajo generates a LogicalPlan as follows.
> > > > > > >
> > > > > > > group by (k)
> > > > > > > |
> > > > > > > scan (t)
> > > > > > >
> > > > > > > This LogicalPlan is translated into a MasterPlan as follows.
> > > > > > >
> > > > > > > -----------------
> > > > > > > Stage2
> > > > > > > group by *k*
> > > > > > > -----------------
> > > > > > > |
> > > > > > > shuffle tuples with *k*
> > > > > > > |
> > > > > > > -----------------
> > > > > > > Stage1
> > > > > > > group by *k*
> > > > > > > |
> > > > > > > scan *t*
> > > > > > > -----------------
> > > > > > >
> > > > > > > As you can see in this example, the query plan consists
of 2
> > > stages.
> > > > > Each
> > > > > > > stage is executed subsequently because the result of Stage
1 is
> > > used
> > > > as
> > > > > > the
> > > > > > > input of Stage 2. Each stage is divided into multiple tasks
for
> > > each
> > > > > > input
> > > > > > > split as follows.
> > > > > > >
> > > > > > > Stage1
> > > > > > >
> > > > > > > Task 1
> > > > > > > group by *k*
> > > > > > > |
> > > > > > > scan *t* (0 - 99)
> > > > > > >
> > > > > > > Task 2
> > > > > > > group by *k*
> > > > > > > |
> > > > > > > scan *t* (100 - 199)
> > > > > > > ...
> > > > > > >
> > > > > > > Each task is executed by a TajoWorker. As you can see,
tasks of
> > the
> > > > > first
> > > > > > > stage execute a local aggregation after scanning input
split.
> > This
> > > > > local
> > > > > > > aggregation result is shuffled among TajoWorkers with the
> > > aggregation
> > > > > key
> > > > > > > *k*. Then, the final aggregation is computed at the second
> stage.
> > > > > > >
> > > > > > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce.
> The
> > > > local
> > > > > > > aggregation of Stage1 is similar to the Combiner of Hadoop
> > > MapReduce.
> > > > > > >
> > > > > > > I hope that this will be helpful to you.
> > > > > > > If you have any further questions, please feel free to
ask.
> > > > > > > Jihoon
> > > > > > >
> > > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <atri.jiit@gmail.com>님이
> > 작성:
> > > > > > >
> > > > > > > Thanks.
> > > > > > > >
> > > > > > > > What are your thoughts on parallel aggregation? Generating
> > query
> > > > > plans
> > > > > > > that
> > > > > > > > allow states to be generated which can be executed
> > independently
> > > > and
> > > > > > then
> > > > > > > > states recombined?
> > > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" <jihoonson@apache.org>
> > wrote:
> > > > > > > >
> > > > > > > > > Hi Atri, thanks for your question.
> > > > > > > > >
> > > > > > > > > First of all, maybe you already did, I recommend
that you
> > read
> > > > this
> > > > > > > > article
> > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > > > > > > > > >
> > > > > > > > > before you start implementation. This is written
by
> Hyunsik,
> > > and
> > > > > > > contains
> > > > > > > > > the description of Tajo's overall infrastructure.
> > Afterwards, I
> > > > > think
> > > > > > > > that
> > > > > > > > > you may ask more detailed question.
> > > > > > > > >
> > > > > > > > > Here, I'll roughly list some important classes
for
> aggregate
> > > > > > > > > implementation.
> > > > > > > > >
> > > > > > > > > - SQLParser.g4 contains our SQL parsing rules.
It is
> > written
> > > > in
> > > > > > > antlr.
> > > > > > > > > - SQLAnalyzer is our parser based on rules
defined at
> > > > > > SQLParser.g4.
> > > > > > > > > - SQLAnalyzer translates a SQL query into
a tree of Expr
> > > which
> > > > > > > > > represents an algebraic expression.
> > > > > > > > > - LogicalPlanner translates the Expr tree
into a
> > LogicalPlan
> > > > > that
> > > > > > > > > logically describes how the given query will
be
> executed.
> > > > > > > > > - GlobalPlanner translates the LogicalPlan
into a
> > MasterPlan
> > > > > > > > > (distributed query execution plan) that describes
how
> the
> > > > given
> > > > > > > query
> > > > > > > > > will
> > > > > > > > > be executed in distributed cluster.
> > > > > > > > > - Once a MasterPlan is created, QueryMaster
starts to
> > > execute
> > > > > > query
> > > > > > > > > processing. A query consists of multiple stages,
which
> are
> > > > > > > > individually
> > > > > > > > > processed in some order.
> > > > > > > > > - For example, a simple aggregation query
is executed
> > in
> > > > two
> > > > > > > > stages,
> > > > > > > > > each of which is for parallel aggregation
and
> combining
> > > > > > > aggregates.
> > > > > > > > > These
> > > > > > > > > stages are executed sequentially.
> > > > > > > > > - A stage is concurrently processed by multiple
tasks,
> and
> > > is
> > > > > > > executed
> > > > > > > > > by TajoWorker.
> > > > > > > > > - Each task contains meta information for
input data
> and a
> > > > > > > LogicalPlan
> > > > > > > > > of the stage. This LogicalPlan is translated
into
> > > PhysicalExec
> > > > > by
> > > > > > > > > PhysicalPlanner.
> > > > > > > > > - PhysicalExec describes how the query is
actually
> > executed.
> > > > > > > > > - For example, there are two types of
> AggregationExec,
> > > > > > > > > i.e., HashAggregateExec and SortAggregateExec,
for
> > > > hash-based
> > > > > > > > > aggregation
> > > > > > > > > and sort-based aggregation, respectively.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Jihoon
> > > > > > > > >
> > > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma
<
> atri.jiit@gmail.com
> > >님이
> > > > 작성:
> > > > > > > > >
> > > > > > > > > > Folks,
> > > > > > > > > >
> > > > > > > > > > I am looking into parallel aggregates/combining
> > aggregates. I
> > > > > have
> > > > > > a
> > > > > > > > plan
> > > > > > > > > > around it which I think can work.
> > > > > > > > > >
> > > > > > > > > > Please update me on current infrastructure
and point me
> > > around
> > > > > the
> > > > > > > > > existing
> > > > > > > > > > code base. Also, ideas would be most welcome
around it.
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > > *l'apprenant*
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
|