calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Using indexes rather than table scans with Calcite
Date Fri, 29 May 2020 17:34:27 GMT
Materialized views are not a hack, as Vladimir claims. Materialized views are a fundamental
concept in relational algebra, and they are an elegant way - in my opinion the correct way
-  to model indexes (and many other structures).

In Calcite materialized views are a feature in the planner that allows you to declare a table
as equivalent to a query. (You do not need to issue a CREATE MATERIALIZED VIEW statement.
They are internal.) Then you can use algebra to substitute one for the other, and apply cost-based
optimization to choose between plans with & without the materialized view.


Other people have used this approach in the past. Search this list and you should find discussions.
Also I have a talk with Maryann Xue about planning for indexes in Apache Phoenix[1].

Julian

[1] https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite
<https://www.slideshare.net/julianhyde/costbased-query-optimization-in-apache-phoenix-using-apache-calcite>

> On May 29, 2020, at 5:28 AM, Roman Kondakov <kondakov87@mail.ru.INVALID> wrote:
> 
> Hi Tim,
> 
> In Apache Ignite we've already faced this challenge. We solved it using
> materialized views and FilterableTable. Let's consider your example:
> 
>> select * from users where country='UK' and some_other_column='foo';
> 
> with a primary index and a sorted secondary index (B+Tree?) over the
> 'country' field.
> 
> As a first step Ignite registers all indexes that might be helpful for
> the query plan in VolcanoPlanner as materialized views. For now we
> register all indexes in all tables that participate in the query.
> Registering all indexes might be excessive, but maybe we will apply some
> pruning later. So it's ok for now to register all indexes.
> 
> Index materialized view is very simple: it's just a sorted table scan:
> 
> 'SELECT * from tbl ORDER BY idx_fields'
> 
> After registering indexes as materialized views, the optimizer's search
> space will look like this:
> 
> Project([*])
>  Filter[country='UK' and some_other_column='foo']
>    Set0
> 
> where Set0 consists of table and index scans:
> 
> TableScan('tbl', collation=[])
> IndexScan('PK', collation=[PK_field])
> IndexScan('country_idx', collation=[country])
> 
> At this step we have our index scans registered in the optimizer within
> the same equivalence set as a TableScan.
> 
> The next step is a pushing filters down to the scans. We do it with a
> rule which is similar to 'FilterTableScanRule'. After applying this rule
> we have a search space that is looking like this:
> 
> Project([*])
>  TableScan('tbl', collation=[], filter=[country='UK' and
> some_other_column='foo'])
>  IndexScan('PK', collation=[PK_field], filter=[country='UK' and
> some_other_column='foo'])
>  IndexScan('country_idx', collation=[country], filter=[country='UK' and
> some_other_column='foo'])
> 
> And the final step is adjusting the cost model to make it select the
> scan with the lower cost which depends on the filter conditions within
> the Scan. For example, full table scan with filters
> 
> TableScan('tbl', collation=[], filter=[country='UK' and
> some_other_column='foo'])
> 
> will cost, say, 100. Because it have to scan all rows and then filter
> out some set of them. On the other hand the index scan that can do index
> lookup instead of full scan will have a less cost. For example
> 
> IndexScan('country_idx', collation=[country], filter=[country='UK' and
> some_other_column='foo'])
> 
> will have cost about ~10, or so because it has a good index condition
> `country='UK'` which can be used for index lookup that that returns only
> 10% of rows. And therefore this IndexScan should be chosen as the best
> plan by the optimizer.
> 
> We've recently implemented this approach in Apache Ignite and it works
> well for us. You can find it in [1]. This PR has many changes that are
> unrelated to the main topic. So particularly you can look at
> `IgnitePlanner.materializations()' method which registers indexes as
> materialized views and `IgniteTableScan` which performs filter
> conditions assessment and index scan cost estimation.
> 
> 
> [1] https://github.com/apache/ignite/pull/7813
> 
> -- 
> Kind Regards
> Roman Kondakov
> 
> 
> On 29.05.2020 11:44, Tim Fox wrote:
>> Hi,
>> 
>> I'm building a query engine with Calcite - really enjoying working with
>> Calcite so far!
>> 
>> When creating a plan, it seems Calcite always creates a plan where the
>> sources are table scans, however in my implementation the tables can have
>> indexes on them so a table scan is not always the right choice.
>> 
>> I was wondering if there was any way of making Calcite "index aware" - e.g.
>> perhaps providing hints to the table scan instance that, actually, an index
>> scan or a primary key lookup should be used instead of actually scanning
>> the table. E.g. On the table meta-data if we provided information about any
>> indexes on the table, then Calcite could figure out what parts of the query
>> to push to the table scan and which to keep in the rest of the plan.
>> 
>> There are two specific cases I really care about:
>> 
>> 1. Queries that contain a primary key lookup:
>> 
>> select * from some_table where key_column=23 AND some_other_column='foo';
>> 
>> In the above case the 'select * from some_table where key_column=23' can be
>> implemented as a simple PK lookup in the source table, not requiring a
>> scan, thus leaving just the filter corresponding to
>> 'some_other_column='foo'' in the rest of the plan
>> 
>> 2. Queries with expressions on a column which has a secondary index
>> 
>> select * from users where country='UK' and some_other_column='foo';
>> 
>> We have many users, and let's say 10% of them are from UK (still a lot). We
>> have a secondary index in the country column in the source table so we can
>> do an efficient index scan to retrieve the matching records.
>> 
>> I found this document
>> https://calcite.apache.org/docs/materialized_views.html which seems like it
>> might help me in some way.
>> 
>> The idea being if I can think of my indexes as materialized views then the
>> query can be written against those materialized views as sources instead of
>> the original table sources. There appears to be a rule
>> 'MaterializedViewRule' that does this already (?).
>> 
>> This seems to get me a bit further, however, for this approach to work, it
>> seems I would have to create materialized views _dynamically_ during
>> evaluation of the query, register them, rewrite the query, execute it, then
>> deregister the materialized view.
>> 
>> E.g. for the primary key lookup example above, for the following query:
>> 
>> select * from some_table where key_column=23 AND some_other_column='foo';
>> 
>> I would need to dynamically create a materialized view corresponding to:
>> 
>> select * from some_table where key_column=23
>> 
>> Then rewrite the query using MaterializedViewRule.
>> 
>> In the general case, in order to figure out what materialized views I need
>> to dynamically create I would need to examine the query, figure out which
>> columns in expressions have indexes on them and from them work out the best
>> materialized view to create based on that information. This seems non
>> trivial.
>> 
>> Does anyone have any suggestions or pointers for how to implement this kind
>> of thing? I suspect I'm not the first person to have tried to do this, as
>> using indexes on tables seems a pretty common thing in many systems (?)
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message