calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Roytman <alexey.royt...@oracle.com>
Subject Re: ProjectableFilterableTable.scan() and expensive columns
Date Tue, 07 Nov 2017 14:55:49 GMT
Luis Fernando, I've got the idea. But there are two "but"s:

1. While in BindableTableScan.computeSelfCost(), I can look at 
table.getRowType() columns and to select these with indices in 
"projects". That's fine. But where do I define the policy of what are 
the costs? I don't want to hardcode BindableTableScan.computeSelfCost()! 
There should be some interface to be implemented that knows to give the 
price.

2. If we extend the idea with the price of filters, when having no 
filters is bad-bad-bad for me as I need to bring a lot of data, we may 
think about any CALCITE-1933's simplification...


And yes, in TranslatableTable I have more power, but it's far more 
complex thing for me. For now I'm fighting the scenario of CALCITE-2039 
("select 0 as c1, ...") with TranslatableTable terms... And yes, 
debugging-digging-creeping...


On 11/06/2017 07:42 PM, Luis Fernando Kauer wrote:
>   Alexey,
>
> The example of computeSelfCost multiplies the cost by projects.size() / identity().size(),
where projects.size() is the number of columns to be read (used) and identity.size() is the
total number of columns of the table. So the cost is reduced if less columns need to be read
instead of reading them all.
> You could use projects.size() / table.getRowType().getFieldCount() instead, but it's
the same thing.
> Or you could use something more complex like in DruidQuery.computeSelfCost.
> If you plan on using ProjectableFilterableTable then you'll have to fix BindableTableScan.computeSelfCost,
because that's what is used to calculate the cost and currently it does not use the projects
size.
> But you can use TranslatableTable just like int CSV Adapter.  The problem with not pushing
the used projects when using some aggregate functions is also solved by implementing computeSelfCost
in CsvTableScan:
>   @Override
>    public RelOptCost computeSelfCost(RelOptPlanner planner, RelMetadataQuery mq) {
>      return super.computeSelfCost(planner, mq).multiplyBy(fields.length/table.getRowType().getFieldCount());
>    }
>
> And by using TranslatableTable you have full control on how to compute the cost to tailor
your needs without having to change anything in Calcite, just in your adapter.
> If you need to push filters down to your table, you'll need to implement that through
rules.  This can be more advanced because you'll need to learn how to create rules but it
will be worth it, because you will understand Calcite better and be able to solve other problems
by yourself.
>
>      Em segunda-feira, 6 de novembro de 2017 12:09:06 BRST, Alexey Roytman <alexey.roytman@oracle.com>
escreveu:
>   
>   Hello, Julian.
> The changes that belong to the Calcite itself -- definitely, I will
> contribute them. For now I have only javadoc-related in pull request.
>
> I understand that my approach shall be flexible and universal enough,
> and aligned with Calcite's ideology (that I'm not acquainted with).
>
> So, for now my only idea is to have an interface
> ProjectableFilterableTableScanCostEstimator that has a:
>
> @Override public RelOptCost computeScanCost(RelOptPlanner planner,
> RelMetadataQuery mq, final List<RexNode> list, final int[] projected);
>
> And at some moment during VolcanoPlanner.findBestExp() flow, we need to
> check whether a (table.table instanceof
> ProjectableFilterableTableScanCostEstimator) and if yes, then multiply
> by computeScanCost(). Maybe, the right place is somewhere near
> BindableTableScan.computeSelfCost(), but I don't remember whether I have
> an access to table.table there...
>
> Regarding your approach
> (https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CALCITE-2D1933&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LhjCnSmzU2hPgGoBHx67hhcdFfWo4OSoQob6udt6lpA&m=5OEmY_TsEOZbYvGoMiSh3sTIgXzykHXpC-WCHVmVXRQ&s=B7LN0iqge8XRZ9QuqkTwfe9N4nwm2pu0vMPnQKyX1wA&e=)
of
> ExpressionComplexity -- I don't get where shall I have an instance of
> this, and how do I push such a beast to ProjectableFilterableTable
> instance. I.e. I don't understand the idea of data composition here...
>
> - Alexey.
>
> On 11/01/2017 10:37 PM, Julian Hyde wrote:
>> Alexey,
>>
>> Do you intend to contribute your changes back to Calcite? If not, feel free to disregard
the following (but also downgrade the amount of free advice that you expect from this list;
we all have other priorities).
>>
>> If you do intend to contribute back, you should keep in mind that some changes will
be more acceptable than others. It’s not sufficient that the changes solve your problem;
your  changes must not cause problems for other people.
>>
>> For instance, adding a field to SqlTypeName is probably not good because it doesn’t
allow people to customize the cost. Also, when extending a class or interface by adding methods
and fields, ask yourself whether it was intended to be extended.
>>
>> The absolute most useful code you can contribute is a test case that covers all of
your requirements and is simple for everyone else to run. Then we can safely refactor your
implementation and know that you are still getting what you need.
>>
>> Are you aware of https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CALCITE-2D1933&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LhjCnSmzU2hPgGoBHx67hhcdFfWo4OSoQob6udt6lpA&m=wxCJy0-IChQRB53y7qCSONbcgJj24vGDvV1P8DylhMg&s=LBpaGzJP6rHVxjUmLc_MXY_w4P_jkJSIx5QhtdXGuiw&e=
<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CALCITE-2D1933&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LhjCnSmzU2hPgGoBHx67hhcdFfWo4OSoQob6udt6lpA&m=wxCJy0-IChQRB53y7qCSONbcgJj24vGDvV1P8DylhMg&s=LBpaGzJP6rHVxjUmLc_MXY_w4P_jkJSIx5QhtdXGuiw&e=>
?  There I suggested adding a new kind of metadata called ExpressionComplexity. You could
write your own handler that takes into account SqlTypeName or whatever you like.
>>
>> Julian
>>
>>
>>
>>> On Oct 31, 2017, at 10:13 AM, Luis Fernando Kauer <lfkauer@yahoo.com.br.INVALID>
wrote:
>>>
>>> Hi,
>>> I'm also newbie and I'm just trying to help with I'm learned so far.
>>> It's all about the rules.  I strongly recommend taking a look into:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__calcite.apache.org_docs_howto.html-23tracing&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LhjCnSmzU2hPgGoBHx67hhcdFfWo4OSoQob6udt6lpA&m=wxCJy0-IChQRB53y7qCSONbcgJj24vGDvV1P8DylhMg&s=8gp7l0GOuDYTZ5Ur5JDPNADLkOckZM6W53FnSfWbCXk&e=
>>>
>>> Specially about setting the logger lever to DEBUG.
>>>
>>> By doing this you can check which rules are being fired, the costs and chosen
plan.
>>> As I mentioned in my last reply, the problem seems to be that BindableTableScan.computeSelfCost
does not use the information about used projects and filters to compute the cost of the BindableTableScan.
>>> Many rules are applied that convert the nodes and at the end the plan with the
lowest cost is chosen.  Among the generated plans are plans that push the projects to the
BindableTableScan and others that do not.  The total cost of the plan that pushes the projects
and the filter is the probably the same or higher than the cost of the plan that pushes only
the filter and removes the projects (through AggregateProjectMergeRule) and the planner ends
up choosing the latter.
>>> So maybe just by fixing computeSelfCost it solves the problem.
>>> Just for the purpose of testing this, I changed computeSelfCost to:
>>> @Override public RelOptCost computeSelfCost(RelOptPlanner planner,
>>>            RelMetadataQuery mq) {
>>>          return super.computeSelfCost(planner, mq).multiplyBy((0.01 + projects.size())/identity().size());
>>>        }
>>> And only by doing this, the planner chose a plan that pushes the projects and
filters to BindableTableScan when used with aggregation.
>>> You can also override getStatistic() for your implementation of ProjectableFilterableTable
to provide some statistics to be used in computing costs.
>>> Calcite also allows other ways to provide information and statistics but that's
beyond my knowledge.
>>> Regards,
>>> Luis Fernando
>>>
>>>
>>>      Em terça-feira, 31 de outubro de 2017 13:38:05 BRST, Alexey Roytman <alexey.roytman@oracle.com>
escreveu:
>>>
>>> Sorry for the delay, Luis Fernando.
>>>
>>> Please see below, as there are a number of answers.
>>>
>>>
>>> On 10/26/2017 09:37 PM, Luis Fernando Kauer wrote:
>>>>      I'm sorry, but I have no idea what you are talking about.
>>>> Cassandra Adapter has code to translate the plan to run the query in Cassandra
Server.
>>>> If you are only interested in querying CSV files I don't see how copying
that code without understanding it will help you.
>>> [Alexey]I need neither Cassandra nor CSV adapter. Cassandra was
>>> mentioned by Julian, so I started investigated it. The CVS files were
>>> used because this way I can create a working test to share with
>>> community to ask questions.
>>>
>>>> First of all, you need to decide whether you will use ProjectableFilterableTable
or TranslatableTable.
>>> [Alexey] I would like to use ProjectableFilterableTable, but it does a
>>> poor work for me: starting from a certain level of nesting, it wants all
>>> projections. And for my task it's a too heavy query: my numeric columns
>>> are remotely calculated, with different (and unpredictable) amount of
>>> work for each column.
>>>
>>> [Alexey] So, as Julian mentioned Cassandra's interface, I started with
>>> it. There was as complication: it has both Translatable and both
>>> Queriable interfaces. For Queriable interfaces, the RexNode are
>>> translated to List<String> (all these translateBinary2() functions etc)
>>> and passed via reflection call to the query. But for me, at the query
>>> level, I need the RexNodes themselves, and I don't want to parse these
>>> List<String> back!
>>>
>>> [Alexey] So, I've started again, but with the Druid adapter, which uses
>>> only Translatable interface.
>>>
>>>> You must try to understand how the rules work and how to check which rules
are being fired and which ones are being chosen.
>>> [Alexey] I do try. And when I reach certain understanding, then
>>> obviously I won't ask such newbie questions that I ask for now :-))
>>>> Did you follow the tutorial for creating CSV Adapter? It creates a rule to
push the used projects to the table scan. That is a great start.
>>> [Alexey] I did. But even in flavor=translatable it is very simplistic,
>>> and did not do the job of having both filters and projections at the
>>> lowest level...
>>>
>>>> It's a good idea to take a look at the built in rules available in Calcite
too.
>>>> You should take a look into FilterTableScanRule and ProjectTableScanRule,
which are the rules that push the projects and filters used with ProjectableFilterableTable
into a BindableTableScan, and the other rules int Bindables.java.
>>> [Alexey] I totally agree with you. But when I look at the code there, I
>>> understand even less than now. This will improve with time, but for now
>>> this is what I have...
>>>> The rules work fine when there is no aggregate function, pushing both filter
and projects into BindableTableScan.  The problem seems to be with AggregateProjectMergeRule
which removes the Project from the plan.
>>>> If you remove the filter from your test cases you'll see that the projects
are pushed to the BindableTableScan.
>>>> I was able to simulate your problem using ScannableTableTest.testProjectableFilterable2WithProject
changing the query into "select \"k\", count(*) from (select \"k\",\"j\" from \"s\".\"beatles\"
where \"i\" = 4) x group by \"k\"".
>>>> The plan:
>>>> LogicalAggregate(group=[{0}], EXPR$1=[COUNT()])
>>>>        LogicalProject(k=[$1])
>>>>          LogicalFilter(condition=[=($0, 4)])
>>>>            LogicalProject(i=[$0], k=[$2])
>>>>              LogicalTableScan(table=[[s, beatles]])
>>>>
>>>> PhysicalPlan:
>>>> EnumerableAggregate(group=[{2}], EXPR$1=[COUNT()]): rowcount = 10.0, cumulative
cost = {61.25 rows, 50.0 cpu, 0.0 io}, id = 112
>>>>        EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0
rows, 50.0 cpu, 0.0 io}, id = 110
>>>>          BindableTableScan(table=[[s, beatles]], filters=[[=($0, 4)]]):
rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 62
>>>>
>>>>
>>>> If I disable AggregateProjectMergeRule, the physical plan is:
>>>> EnumerableAggregate(group=[{0}], EXPR$1=[COUNT()]): rowcount = 10.0, cumulative
cost = {61.25 rows, 50.0 cpu, 0.0 io}, id = 102
>>>>        EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0
rows, 50.0 cpu, 0.0 io}, id = 100
>>>>          BindableTableScan(table=[[s, beatles]], filters=[[=($0, 4)]],
projects=[[2]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 78
>>> [Alexey] Well, but this does not advances me in the direction of fixing
>>> ProjectableFilterable loosing projections...
>>>
>>> Thank you very much for your hints and patience!
>>>
>>> - Alexey.
>>>
>>>> Regards,
>>>>
>>>> Luis Fernando
>>>>
>>>>
>>>>
>>>>          Em quinta-feira, 26 de outubro de 2017 13:19:46 BRST, Alexey
Roytman <alexey.roytman@oracle.com> escreveu:
>>>>      
>>>>      Thanks for the hints.
>>>>
>>>> I've tried to use [i.e. copy-pasted a lot of] Cassandra*.java for my
>>>> CSV-files example. It's really too wordy! So lot of code I need to
>>>> understand later!..
>>>>
>>>> But what bothers me most for now is the fact that I just cannot pass
>>>> List<RexNode> to [my modification of] CassandraTable.query(); I need
to
>>>> convert it to some string form within List<String> using
>>>> CassandraFilter.Translator, and then, when passed to [my modification
>>>> of] CassandraTable.query(), I need to parse these List<String> back...
>>>> Is there way to eliminate this back-and-forth serialization-deserialization?
>>>>
>>>> - Alexey.
>>>>
>>>> (P.S. Sorry for not keeping the email thread for now...)
>>>>
>>>> Julian Hyde wrote wrote:
>>>>> By "write a rule" I mean write a class that extends RelOptRule. An
>>>>> example is CassandraRules.CassandraFilterRule.
>>>>> ProjectableFilterableTable was "only" designed for the case that
>>>>> occurs 80% of the time but requires 20% of the functionality. Rules
>>>>> run in a richer environment so have more power and flexibility.
>    


Mime
View raw message