calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Pullokkaran <jpullokka...@hortonworks.com>
Subject Re: Adding Exchange operator and Distribution trait
Date Wed, 11 Feb 2015 23:29:54 GMT
Hive uses greedy Join Order Algorithm (LoptOptimizeJoinRule).
We are thinking of dividing the join graph to sub graphs to work around
scalability issues (if it arises).

I thought in VolcanoPlanner you could specify a time bound.
 

On 2/11/15, 3:23 PM, "Jinfeng Ni" <jinfengni99@gmail.com> wrote:

>About John's comment about put bounds in the plan search space, does
>Calcite allow us to specify some bounds in the planner, and stop the
>searching with the best plan found so far after that bounds are meet?
>
>AFAIK, in TpchTest, if I turn on "calcite.test.slow", then some queries
>like TPCH  Q5, Q7, Q8 seem to not come back with a plan, after several
>minutes, when run on my laptop.  If Calcite has the ability to specify the
>search bounds ( say # of rules are fired, or # of possible plans
>enumerated), then it should return a plan within a reasonable amount of
>time, in stead of keeping on searching, searching, and searching, and
>possibly never end.
>
>
>
>On Wed, Feb 11, 2015 at 2:49 PM, Julian Hyde <julianhyde@gmail.com> wrote:
>
>> Aman,
>>
>> RelDistribution will be an interface, and there¹s no reason why Drill
>> shouldn¹t have its own values or even sub-classes. As long as
>> RelDistributionTraitDef is able to canonize them. So you could, for
>> instance, sub-class "Hash[1, 3]² and specify which hash function is
>>being
>> used.
>>
>> I¹ve addressed the comment about logical exchange already ‹ you can go
>> straight to physical.
>>
>>
>> On Feb 11, 2015, at 2:34 PM, Aman Sinha <asinha@maprtech.com> wrote:
>>
>> > I am neutral on this for now until we give it more thought.  The
>>reason
>> > being that since Calcite is not aware of the execution engine's
>> capability
>> > and configuration parameters for distribution (e.g Drill has a few
>> > parameters, including just true/false type of flags that determine
>> whether
>> > or not an Exchange node is even inserted in the plan and if it is
>>used,
>> > what type of Exchange it is etc.).  In that sense, if the logical plan
>> > produced by Calcite contains a LogicalExchange, it is possible that
>>Drill
>> > may not be able use it directly while building the physical plan.
>> >
>> > I do however see the benefits in terms of trait propagation, combining
>> > distribution and collation traits and consolidating the subsumption
>>logic
>> > in some base class such that it is useful for other consumers of
>>Calcite.
>> >
>> > Aman
>> >
>> > On Wed, Feb 11, 2015 at 2:21 PM, Jinfeng Ni <jinfengni99@gmail.com>
>> wrote:
>> >
>> >> Drill currently  do query planing in two phases : 1) logical
>>planning,
>> >> which handles join order, logical filter/project push down etc, and
>>2)
>> >> physical planning, which makes decision between different physical
>> >> operators ( different join / aggregation method), filter/project push
>> down
>> >> (storage-specific rule), and insert EXCHANGE.   Part of the reason to
>> put
>> >> into two phases is when the two phases are merged together, the
>>planning
>> >> time is increased significantly ( since the planner need to enumerate
>> >> different join orders, multiplied by different choices of EXCHANGE).
>> >>
>> >> The new rules that you are proposing seems to want to build plan in
>>one
>> >> single logical planing phase.  I'm not sure how it will impact the
>> overall
>> >> planning time.
>> >>
>> >>
>> >>
>> >> On Wed, Feb 11, 2015 at 1:38 PM, Jinfeng Ni <jinfengni99@gmail.com>
>> wrote:
>> >>
>> >>> I think it's a good proposal to put Exchange/Distribution into
>>Calcite
>> >>> library.
>> >>>
>> >>> Make sense to me.  +1
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Feb 11, 2015 at 12:45 PM, Julian Hyde <jhyde@apache.org>
>> wrote:
>> >>>
>> >>>> Drill guys: What do you think of the proposal?
>> >>>>
>> >>>> On Feb 11, 2015, at 11:34 AM, Ashutosh Chauhan
>><hashutosh@apache.org>
>> >>>> wrote:
>> >>>>
>> >>>> Overall proposal sounds good to me. +1
>> >>>>
>> >>>> On Tue, Feb 10, 2015 at 3:35 PM, Julian Hyde <jhyde@apache.org>
>> wrote:
>> >>>>
>> >>>> I've had some discussions about adding an Exchange operator and
>> >>>> Distribution trait to Hive's cost-based optimizer, which uses
>>Calcite.
>> >>>> Ashutosh has logged a bug [
>> >>>> https://issues.apache.org/jira/browse/CALCITE-594 ] and pull
>>request
>> >>>> containing a proof-of-concept [
>> >>>> https://github.com/apache/incubator-calcite/pull/52/files ].
>> >>>>
>> >>>> I know that Drill has a Distribution trait and several sub-classes
>>of
>> >>>> Exchange operator (DrillDistributionTrait, ExchangePrel,
>> >>>> BroadcastExchangePrel, HashToMergeExchangePrel,
>> >> HashToRandomExchangePrel,
>> >>>> OrderedPartitionExchangePrel and SimpleMergeExchangePrel, in
>> >>>>
>> >>>>
>> >>>>
>> >>
>> 
>>https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/
>>org/apache/drill/exec/planner/physical
>> >>>> )
>> >>>>
>> >>>> I propose to create a Distribution trait and Exchange operator base
>> >> class
>> >>>> in Calcite, with the goal that both Drill and Hive would use them.
>>(I
>> am
>> >>>> adopting Drill terminology -- Distribution rather than Partition,
>> >> Exchange
>> >>>> rather than Shuffle -- but I am pretty sure that the concepts are
>>the
>> >>>> same.)
>> >>>>
>> >>>> public abstract class Exchange extends SingleRel {
>> >>>> public final RelDistribution distribution;
>> >>>>
>> >>>> protected Exchange(RelCluster cluster, RelTraitSet traitSet,
>>RelNode
>> >>>> input, RelDistribution distribution) {
>> >>>>   super(cluster, traitSet, input);
>> >>>>   this.distribution = distribution;
>> >>>> }
>> >>>> }
>> >>>>
>> >>>> public interface RelDistribution extends RelMultipleTrait {
>> >>>> enum DistributionType {
>> >>>>   SINGLETON,
>> >>>>   HASH_DISTRIBUTED,
>> >>>>   RANGE_DISTRIBUTED,
>> >>>>   RANDOM_DISTRIBUTED,
>> >>>>   ROUND_ROBIN_DISTRIBUTED,
>> >>>>   BROADCAST_DISTRIBUTED
>> >>>> }
>> >>>>
>> >>>> public DistributionType getType();
>> >>>> public ImmutableIntList getFields();
>> >>>> }
>> >>>>
>> >>>> Calcite would not contain any particular exchange algorithms.
>>However,
>> >>>> since it is common to combine sort and exchange, I would create
a
>>base
>> >>>> class for it:
>> >>>>
>> >>>> public abstract class SortExchange extends Exchange {
>> >>>> public final Collation collation;
>> >>>>
>> >>>> ...
>> >>>> }
>> >>>>
>> >>>> The physical operators would remain in Drill/Hive and would likely
>>be
>> >>>> fully
>> >>>> specified by the distribution and collation; they would not need
>>any
>> >>>> additional attributes. We would not be able to port
>> >>>> DrillDistributionTraitDef.convert directly -- it would create a
>> >>>> LogicalExchange (analogous to how RelCollationTraitDef.convert
>> creates a
>> >>>> LogicalSort) and then Drill rules would need to kick in to convert
>> that
>> >> to
>> >>>> HashToRandomExchangePrel etc.
>> >>>>
>> >>>> I do not think that RelDistribution needs to be a "multiple" trait
>> >>>> (compare
>> >>>> with RelCollation extends RelMultipleTrait, which allows a RelNode
>>to
>> >> have
>> >>>> more than one sort-order) but I may be wrong.
>> >>>>
>> >>>> The advantages of making Exchange a first-class operator and
>> >> Distribution
>> >>>> a
>> >>>> trait are clear. We will be able to build a library of rules (e.g.
>> >>>> FilterExchangePushRule, ExchangeRemoveRule), a RelMdDistribution
>> >> metadata
>> >>>> interface, and start working on stats and cost model.
>> >>>>
>> >>>> Drill and Hive stakeholders, please let me know what you think of
>>this
>> >>>> plan.
>> >>>>
>> >>>> Julian
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>


Mime
View raw message