calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@apache.org>
Subject Re: Adding Exchange operator and Distribution trait
Date Wed, 11 Feb 2015 23:39:33 GMT
Is there any proposal for partition range (hash or ordered) awareness?

On Wed, Feb 11, 2015 at 3:29 PM, John Pullokkaran <
jpullokkaran@hortonworks.com> wrote:

> Hive uses greedy Join Order Algorithm (LoptOptimizeJoinRule).
> We are thinking of dividing the join graph to sub graphs to work around
> scalability issues (if it arises).
>
> I thought in VolcanoPlanner you could specify a time bound.
>
>
> On 2/11/15, 3:23 PM, "Jinfeng Ni" <jinfengni99@gmail.com> wrote:
>
> >About John's comment about put bounds in the plan search space, does
> >Calcite allow us to specify some bounds in the planner, and stop the
> >searching with the best plan found so far after that bounds are meet?
> >
> >AFAIK, in TpchTest, if I turn on "calcite.test.slow", then some queries
> >like TPCH  Q5, Q7, Q8 seem to not come back with a plan, after several
> >minutes, when run on my laptop.  If Calcite has the ability to specify the
> >search bounds ( say # of rules are fired, or # of possible plans
> >enumerated), then it should return a plan within a reasonable amount of
> >time, in stead of keeping on searching, searching, and searching, and
> >possibly never end.
> >
> >
> >
> >On Wed, Feb 11, 2015 at 2:49 PM, Julian Hyde <julianhyde@gmail.com>
> wrote:
> >
> >> Aman,
> >>
> >> RelDistribution will be an interface, and there¹s no reason why Drill
> >> shouldn¹t have its own values or even sub-classes. As long as
> >> RelDistributionTraitDef is able to canonize them. So you could, for
> >> instance, sub-class "Hash[1, 3]² and specify which hash function is
> >>being
> >> used.
> >>
> >> I¹ve addressed the comment about logical exchange already ‹ you can go
> >> straight to physical.
> >>
> >>
> >> On Feb 11, 2015, at 2:34 PM, Aman Sinha <asinha@maprtech.com> wrote:
> >>
> >> > I am neutral on this for now until we give it more thought.  The
> >>reason
> >> > being that since Calcite is not aware of the execution engine's
> >> capability
> >> > and configuration parameters for distribution (e.g Drill has a few
> >> > parameters, including just true/false type of flags that determine
> >> whether
> >> > or not an Exchange node is even inserted in the plan and if it is
> >>used,
> >> > what type of Exchange it is etc.).  In that sense, if the logical plan
> >> > produced by Calcite contains a LogicalExchange, it is possible that
> >>Drill
> >> > may not be able use it directly while building the physical plan.
> >> >
> >> > I do however see the benefits in terms of trait propagation, combining
> >> > distribution and collation traits and consolidating the subsumption
> >>logic
> >> > in some base class such that it is useful for other consumers of
> >>Calcite.
> >> >
> >> > Aman
> >> >
> >> > On Wed, Feb 11, 2015 at 2:21 PM, Jinfeng Ni <jinfengni99@gmail.com>
> >> wrote:
> >> >
> >> >> Drill currently  do query planing in two phases : 1) logical
> >>planning,
> >> >> which handles join order, logical filter/project push down etc, and
> >>2)
> >> >> physical planning, which makes decision between different physical
> >> >> operators ( different join / aggregation method), filter/project push
> >> down
> >> >> (storage-specific rule), and insert EXCHANGE.   Part of the reason
to
> >> put
> >> >> into two phases is when the two phases are merged together, the
> >>planning
> >> >> time is increased significantly ( since the planner need to enumerate
> >> >> different join orders, multiplied by different choices of EXCHANGE).
> >> >>
> >> >> The new rules that you are proposing seems to want to build plan in
> >>one
> >> >> single logical planing phase.  I'm not sure how it will impact the
> >> overall
> >> >> planning time.
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Feb 11, 2015 at 1:38 PM, Jinfeng Ni <jinfengni99@gmail.com>
> >> wrote:
> >> >>
> >> >>> I think it's a good proposal to put Exchange/Distribution into
> >>Calcite
> >> >>> library.
> >> >>>
> >> >>> Make sense to me.  +1
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Feb 11, 2015 at 12:45 PM, Julian Hyde <jhyde@apache.org>
> >> wrote:
> >> >>>
> >> >>>> Drill guys: What do you think of the proposal?
> >> >>>>
> >> >>>> On Feb 11, 2015, at 11:34 AM, Ashutosh Chauhan
> >><hashutosh@apache.org>
> >> >>>> wrote:
> >> >>>>
> >> >>>> Overall proposal sounds good to me. +1
> >> >>>>
> >> >>>> On Tue, Feb 10, 2015 at 3:35 PM, Julian Hyde <jhyde@apache.org>
> >> wrote:
> >> >>>>
> >> >>>> I've had some discussions about adding an Exchange operator
and
> >> >>>> Distribution trait to Hive's cost-based optimizer, which uses
> >>Calcite.
> >> >>>> Ashutosh has logged a bug [
> >> >>>> https://issues.apache.org/jira/browse/CALCITE-594 ] and pull
> >>request
> >> >>>> containing a proof-of-concept [
> >> >>>> https://github.com/apache/incubator-calcite/pull/52/files ].
> >> >>>>
> >> >>>> I know that Drill has a Distribution trait and several sub-classes
> >>of
> >> >>>> Exchange operator (DrillDistributionTrait, ExchangePrel,
> >> >>>> BroadcastExchangePrel, HashToMergeExchangePrel,
> >> >> HashToRandomExchangePrel,
> >> >>>> OrderedPartitionExchangePrel and SimpleMergeExchangePrel, in
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>
> >>
> >>
> https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/
> >>org/apache/drill/exec/planner/physical
> >> >>>> )
> >> >>>>
> >> >>>> I propose to create a Distribution trait and Exchange operator
base
> >> >> class
> >> >>>> in Calcite, with the goal that both Drill and Hive would use
them.
> >>(I
> >> am
> >> >>>> adopting Drill terminology -- Distribution rather than Partition,
> >> >> Exchange
> >> >>>> rather than Shuffle -- but I am pretty sure that the concepts
are
> >>the
> >> >>>> same.)
> >> >>>>
> >> >>>> public abstract class Exchange extends SingleRel {
> >> >>>> public final RelDistribution distribution;
> >> >>>>
> >> >>>> protected Exchange(RelCluster cluster, RelTraitSet traitSet,
> >>RelNode
> >> >>>> input, RelDistribution distribution) {
> >> >>>>   super(cluster, traitSet, input);
> >> >>>>   this.distribution = distribution;
> >> >>>> }
> >> >>>> }
> >> >>>>
> >> >>>> public interface RelDistribution extends RelMultipleTrait {
> >> >>>> enum DistributionType {
> >> >>>>   SINGLETON,
> >> >>>>   HASH_DISTRIBUTED,
> >> >>>>   RANGE_DISTRIBUTED,
> >> >>>>   RANDOM_DISTRIBUTED,
> >> >>>>   ROUND_ROBIN_DISTRIBUTED,
> >> >>>>   BROADCAST_DISTRIBUTED
> >> >>>> }
> >> >>>>
> >> >>>> public DistributionType getType();
> >> >>>> public ImmutableIntList getFields();
> >> >>>> }
> >> >>>>
> >> >>>> Calcite would not contain any particular exchange algorithms.
> >>However,
> >> >>>> since it is common to combine sort and exchange, I would create
a
> >>base
> >> >>>> class for it:
> >> >>>>
> >> >>>> public abstract class SortExchange extends Exchange {
> >> >>>> public final Collation collation;
> >> >>>>
> >> >>>> ...
> >> >>>> }
> >> >>>>
> >> >>>> The physical operators would remain in Drill/Hive and would
likely
> >>be
> >> >>>> fully
> >> >>>> specified by the distribution and collation; they would not
need
> >>any
> >> >>>> additional attributes. We would not be able to port
> >> >>>> DrillDistributionTraitDef.convert directly -- it would create
a
> >> >>>> LogicalExchange (analogous to how RelCollationTraitDef.convert
> >> creates a
> >> >>>> LogicalSort) and then Drill rules would need to kick in to
convert
> >> that
> >> >> to
> >> >>>> HashToRandomExchangePrel etc.
> >> >>>>
> >> >>>> I do not think that RelDistribution needs to be a "multiple"
trait
> >> >>>> (compare
> >> >>>> with RelCollation extends RelMultipleTrait, which allows a
RelNode
> >>to
> >> >> have
> >> >>>> more than one sort-order) but I may be wrong.
> >> >>>>
> >> >>>> The advantages of making Exchange a first-class operator and
> >> >> Distribution
> >> >>>> a
> >> >>>> trait are clear. We will be able to build a library of rules
(e.g.
> >> >>>> FilterExchangePushRule, ExchangeRemoveRule), a RelMdDistribution
> >> >> metadata
> >> >>>> interface, and start working on stats and cost model.
> >> >>>>
> >> >>>> Drill and Hive stakeholders, please let me know what you think
of
> >>this
> >> >>>> plan.
> >> >>>>
> >> >>>> Julian
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message