calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Beikov <christian.bei...@gmail.com>
Subject Re: Materialization performance
Date Tue, 29 Aug 2017 10:11:19 GMT
I'd like to stick to trying to figure out how to support outer joins for 
now and when I have an implementation for that, I'd look into the filter 
tree index if you haven't done it by then.


Mit freundlichen Grüßen,
------------------------------------------------------------------------
*Christian Beikov*
Am 28.08.2017 um 20:01 schrieb Jesus Camacho Rodriguez:
> Christian,
>
> The implementation of the filter tree index is what I was referring to
> indeed. In the initial implementation I focused on the rewriting coverage,
> but now that the first part is finished, it is at the top of my list as
> I think it is critical to make the whole query rewriting algorithm work
> at scale. However, I have not started yet.
>
> The filter tree index will help to filter not only based on the tables used
> by a given query, but also for queries that do not meet the equivalence
> classes conditions, filter conditions, etc. We could implement all the
> preconditions mentioned in the paper, and we could add our own additional
> ones. I also think that in a second version, we might need to maybe add
> some kind of ranking/limit as many views might meet the preconditions for
> a given query.
>
> It seems you understood how it should work, so if you could help to
> quickstart that work by maybe implementing a first version of the filter
> tree index with a couple of basic conditions (table matching and EC matching?),
> that would be great. I could review any of the contributions you make.
>
> -Jesús
>
>
>
>
>
> On 8/28/17, 3:22 AM, "Christian Beikov" <christian.beikov@gmail.com> wrote:
>
>> If the metadata was cached, that would be awesome, especially because
>> that would also improve the prformance regarding the metadata retrival
>> for the query currently being planned, although I am not sure how the
>> caching would work since the RelNodes are mutable.
>>
>> Have you considered implementing the filter tree index explained in the
>> paper? As far as I understood, the whole thing only works when a
>> redundant table elimination is implemented. Is that the case? If so, or
>> if it can be done easily, I'd propose we initialize all the lookup
>> structures during registration and use them during planning. This will
>> improve planning time drastically and essentially handle the scalability
>> problem you mention.
>>
>> What other MV-related issues are on your personal todo list Jesus? I
>> read the paper now and think I can help you in one place or another if
>> you want.
>>
>>
>> Mit freundlichen Grüßen,
>> ------------------------------------------------------------------------
>> *Christian Beikov*
>> Am 28.08.2017 um 08:13 schrieb Jesus Camacho Rodriguez:
>>> Hive does not use the Calcite SQL parser, thus we follow a different path
>>> and did not experience the problem on the Calcite end. However, FWIW we
>>> avoided reparsing the SQL every time a query was being planned by
>>> creating/managing our own cache too.
>>>
>>> The metadata providers implement some caching, thus I would expect that once
>>> you avoid reparsing every MV, the retrieval time of predicates, lineage, etc.
>>> would improve (at least after using the MV for the first time). However,
>>> I agree that the information should be inferred when the MV is loaded.
>>> In fact, maybe just making some calls to the metadata providers while the MVs
>>> are being loaded would do the trick (Julian should confirm this).
>>>
>>> Btw, probably you will find another scalability issue as the number of MVs
>>> grows large with the current implementation of the rewriting, since the´
>>> pre-filtering implementation in place does not discard many of the views that
>>> are not valid to rewrite a given query, and rewriting is attempted with all
>>> of them.
>>> This last bit is work that I would like to tackle shortly, but I have not
>>> created the corresponding JIRA yet.
>>>
>>> -Jesús
>>>    
>>>
>>>
>>>
>>> On 8/27/17, 10:43 PM, "Rajat Venkatesh" <rvenkatesh@qubole.com> wrote:
>>>
>>>> Thread Safety and repeated parsing is a problem. We have experience with
>>>> managing 10s of materialized views. Repeated parsing takes more time than
>>>> execution of the query itself. We also have a similar problem where
>>>> concurrent queries (with a different set of materialized views potentailly)
>>>> maybe planned at the same time. We solved it through maintaining a cache
>>>> and carefully setting the cache in a thread local.
>>>> Relevant code for inspiration:
>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/prepare/Materializer.java
>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/plan/QuarkMaterializeCluster.java
>>>>
>>>>
>>>>
>>>> On Sun, Aug 27, 2017 at 6:50 PM Christian Beikov <christian.beikov@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey, I have been looking a bit into how materialized views perform
>>>>> during the planning because of a very long test
>>>>> run(MaterializationTest#testJoinMaterializationUKFK6) and the current
>>>>> state is problematic.
>>>>>
>>>>> CalcitePrepareImpl#getMaterializations always reparses the SQL and down
>>>>> the line, there is a lot of expensive work(e.g. predicate and lineage
>>>>> determination) done during planning that could easily be pre-calculated
>>>>> and cached during materialization creation.
>>>>>
>>>>> There is also a bit of a thread safety problem with the current
>>>>> implementation. Unless there is a different safety mechanism that I
>>>>> don't see, the sharing of the MaterializationService and thus also the
>>>>> maps in MaterializationActor via a static instance between multiple
>>>>> threads is problematic.
>>>>>
>>>>> Since I mentioned thread safety, how is Calcite supposed to be used in
a
>>>>> multi-threaded environment? Currently I use a connection pool that
>>>>> initializes the schema on new connections, but that is not really nice.
>>>>> I suppose caches are also bound to the connection? A thread safe context
>>>>> that can be shared between connections would be nice to avoid all that
>>>>> repetitive work.
>>>>>
>>>>> Are these known issues which you have thought about how to fix or should
>>>>> I log JIRAs for these and fix them to the best of my knowledge? I'd more
>>>>> or less keep the service shared but would implement it using a copy on
>>>>> write strategy since I'd expect seldom schema changes after startup.
>>>>>
>>>>> Regarding the repetitive work that partly happens during planning, I'd
>>>>> suggest doing that during materialization registration instead like it
>>>>> is already mentioned CalcitePrepareImpl#populateMaterializations. Would
>>>>> that be ok?
>>>>>
>>>>> --
>>>>>
>>>>> Mit freundlichen Grüßen,
>>>>> ------------------------------------------------------------------------
>>>>> *Christian Beikov*
>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message