calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Beikov <>
Subject Re: Materialization performance
Date Tue, 29 Aug 2017 10:11:19 GMT
I'd like to stick to trying to figure out how to support outer joins for 
now and when I have an implementation for that, I'd look into the filter 
tree index if you haven't done it by then.

Mit freundlichen Grüßen,
*Christian Beikov*
Am 28.08.2017 um 20:01 schrieb Jesus Camacho Rodriguez:
> Christian,
> The implementation of the filter tree index is what I was referring to
> indeed. In the initial implementation I focused on the rewriting coverage,
> but now that the first part is finished, it is at the top of my list as
> I think it is critical to make the whole query rewriting algorithm work
> at scale. However, I have not started yet.
> The filter tree index will help to filter not only based on the tables used
> by a given query, but also for queries that do not meet the equivalence
> classes conditions, filter conditions, etc. We could implement all the
> preconditions mentioned in the paper, and we could add our own additional
> ones. I also think that in a second version, we might need to maybe add
> some kind of ranking/limit as many views might meet the preconditions for
> a given query.
> It seems you understood how it should work, so if you could help to
> quickstart that work by maybe implementing a first version of the filter
> tree index with a couple of basic conditions (table matching and EC matching?),
> that would be great. I could review any of the contributions you make.
> -Jesús
> On 8/28/17, 3:22 AM, "Christian Beikov" <> wrote:
>> If the metadata was cached, that would be awesome, especially because
>> that would also improve the prformance regarding the metadata retrival
>> for the query currently being planned, although I am not sure how the
>> caching would work since the RelNodes are mutable.
>> Have you considered implementing the filter tree index explained in the
>> paper? As far as I understood, the whole thing only works when a
>> redundant table elimination is implemented. Is that the case? If so, or
>> if it can be done easily, I'd propose we initialize all the lookup
>> structures during registration and use them during planning. This will
>> improve planning time drastically and essentially handle the scalability
>> problem you mention.
>> What other MV-related issues are on your personal todo list Jesus? I
>> read the paper now and think I can help you in one place or another if
>> you want.
>> Mit freundlichen Grüßen,
>> ------------------------------------------------------------------------
>> *Christian Beikov*
>> Am 28.08.2017 um 08:13 schrieb Jesus Camacho Rodriguez:
>>> Hive does not use the Calcite SQL parser, thus we follow a different path
>>> and did not experience the problem on the Calcite end. However, FWIW we
>>> avoided reparsing the SQL every time a query was being planned by
>>> creating/managing our own cache too.
>>> The metadata providers implement some caching, thus I would expect that once
>>> you avoid reparsing every MV, the retrieval time of predicates, lineage, etc.
>>> would improve (at least after using the MV for the first time). However,
>>> I agree that the information should be inferred when the MV is loaded.
>>> In fact, maybe just making some calls to the metadata providers while the MVs
>>> are being loaded would do the trick (Julian should confirm this).
>>> Btw, probably you will find another scalability issue as the number of MVs
>>> grows large with the current implementation of the rewriting, since the´
>>> pre-filtering implementation in place does not discard many of the views that
>>> are not valid to rewrite a given query, and rewriting is attempted with all
>>> of them.
>>> This last bit is work that I would like to tackle shortly, but I have not
>>> created the corresponding JIRA yet.
>>> -Jesús
>>> On 8/27/17, 10:43 PM, "Rajat Venkatesh" <> wrote:
>>>> Thread Safety and repeated parsing is a problem. We have experience with
>>>> managing 10s of materialized views. Repeated parsing takes more time than
>>>> execution of the query itself. We also have a similar problem where
>>>> concurrent queries (with a different set of materialized views potentailly)
>>>> maybe planned at the same time. We solved it through maintaining a cache
>>>> and carefully setting the cache in a thread local.
>>>> Relevant code for inspiration:
>>>> On Sun, Aug 27, 2017 at 6:50 PM Christian Beikov <>
>>>> wrote:
>>>>> Hey, I have been looking a bit into how materialized views perform
>>>>> during the planning because of a very long test
>>>>> run(MaterializationTest#testJoinMaterializationUKFK6) and the current
>>>>> state is problematic.
>>>>> CalcitePrepareImpl#getMaterializations always reparses the SQL and down
>>>>> the line, there is a lot of expensive work(e.g. predicate and lineage
>>>>> determination) done during planning that could easily be pre-calculated
>>>>> and cached during materialization creation.
>>>>> There is also a bit of a thread safety problem with the current
>>>>> implementation. Unless there is a different safety mechanism that I
>>>>> don't see, the sharing of the MaterializationService and thus also the
>>>>> maps in MaterializationActor via a static instance between multiple
>>>>> threads is problematic.
>>>>> Since I mentioned thread safety, how is Calcite supposed to be used in
>>>>> multi-threaded environment? Currently I use a connection pool that
>>>>> initializes the schema on new connections, but that is not really nice.
>>>>> I suppose caches are also bound to the connection? A thread safe context
>>>>> that can be shared between connections would be nice to avoid all that
>>>>> repetitive work.
>>>>> Are these known issues which you have thought about how to fix or should
>>>>> I log JIRAs for these and fix them to the best of my knowledge? I'd more
>>>>> or less keep the service shared but would implement it using a copy on
>>>>> write strategy since I'd expect seldom schema changes after startup.
>>>>> Regarding the repetitive work that partly happens during planning, I'd
>>>>> suggest doing that during materialization registration instead like it
>>>>> is already mentioned CalcitePrepareImpl#populateMaterializations. Would
>>>>> that be ok?
>>>>> --
>>>>> Mit freundlichen Grüßen,
>>>>> ------------------------------------------------------------------------
>>>>> *Christian Beikov*

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message