calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Beikov <christian.bei...@gmail.com>
Subject Re: Materialization performance
Date Tue, 29 Aug 2017 10:28:52 GMT
If it were a proper actor like you described it, concurrency wouldn't be 
a problem, but right now it is just a global holder for non-concurrent 
hash maps which is the problem. Currently, it's simply not thread safe.

I don't see a benefit in having request and response queues, I'd rather 
make registration and retrival synchronous. Maybe you could explain to 
me why you were favoring that model? Having to go through concurrent 
queues for every planner invocation seems to me like an overkill. I'd 
rather have immutable state being CASed(compare-and-swap) to make the 
querying cheap and do updates in an optimistic concurrency control manner.

What do you say? Would that be a possibility?


Mit freundlichen Grüßen,
------------------------------------------------------------------------
*Christian Beikov*
Am 28.08.2017 um 21:31 schrieb Julian Hyde:
> I gave some thought to performance and thread safety when I added materialized view support.
I didn’t follow through and test at high load and parallelism because at that point, functionality
was more important. I’m glad we’re having the discussion now.
>
> The solution I settled on is the actor model[1]. That means that one thread is responsible
for accessing a critical data structure (in this case, the set of valid materialized views).
Other threads are not allowed to modify, or even see, mutable state. They can send immutable
requests, and get immutable objects back in response.
>
> This is manifested as the MaterializationService and MaterializationActor; see the comment
in the latter:
>
>    // Not an actor yet -- TODO make members private and add request/response
>    // queues
>   
> If we did that I think we would be well on the way to a thread-safe architecture. We
can improve performance further, if necessary, by reducing the work that has to be done by
the actor, as long as it alone is responsible for the mutable state.
>
> Julian
>
> [1] https://en.wikipedia.org/wiki/Actor_model <https://en.wikipedia.org/wiki/Actor_model>
>
>> On Aug 28, 2017, at 11:01 AM, Jesus Camacho Rodriguez <jcamachorodriguez@hortonworks.com>
wrote:
>>
>> Christian,
>>
>> The implementation of the filter tree index is what I was referring to
>> indeed. In the initial implementation I focused on the rewriting coverage,
>> but now that the first part is finished, it is at the top of my list as
>> I think it is critical to make the whole query rewriting algorithm work
>> at scale. However, I have not started yet.
>>
>> The filter tree index will help to filter not only based on the tables used
>> by a given query, but also for queries that do not meet the equivalence
>> classes conditions, filter conditions, etc. We could implement all the
>> preconditions mentioned in the paper, and we could add our own additional
>> ones. I also think that in a second version, we might need to maybe add
>> some kind of ranking/limit as many views might meet the preconditions for
>> a given query.
>>
>> It seems you understood how it should work, so if you could help to
>> quickstart that work by maybe implementing a first version of the filter
>> tree index with a couple of basic conditions (table matching and EC matching?),
>> that would be great. I could review any of the contributions you make.
>>
>> -Jesús
>>
>>
>>
>>
>>
>> On 8/28/17, 3:22 AM, "Christian Beikov" <christian.beikov@gmail.com> wrote:
>>
>>> If the metadata was cached, that would be awesome, especially because
>>> that would also improve the prformance regarding the metadata retrival
>>> for the query currently being planned, although I am not sure how the
>>> caching would work since the RelNodes are mutable.
>>>
>>> Have you considered implementing the filter tree index explained in the
>>> paper? As far as I understood, the whole thing only works when a
>>> redundant table elimination is implemented. Is that the case? If so, or
>>> if it can be done easily, I'd propose we initialize all the lookup
>>> structures during registration and use them during planning. This will
>>> improve planning time drastically and essentially handle the scalability
>>> problem you mention.
>>>
>>> What other MV-related issues are on your personal todo list Jesus? I
>>> read the paper now and think I can help you in one place or another if
>>> you want.
>>>
>>>
>>> Mit freundlichen Grüßen,
>>> ------------------------------------------------------------------------
>>> *Christian Beikov*
>>> Am 28.08.2017 um 08:13 schrieb Jesus Camacho Rodriguez:
>>>> Hive does not use the Calcite SQL parser, thus we follow a different path
>>>> and did not experience the problem on the Calcite end. However, FWIW we
>>>> avoided reparsing the SQL every time a query was being planned by
>>>> creating/managing our own cache too.
>>>>
>>>> The metadata providers implement some caching, thus I would expect that once
>>>> you avoid reparsing every MV, the retrieval time of predicates, lineage,
etc.
>>>> would improve (at least after using the MV for the first time). However,
>>>> I agree that the information should be inferred when the MV is loaded.
>>>> In fact, maybe just making some calls to the metadata providers while the
MVs
>>>> are being loaded would do the trick (Julian should confirm this).
>>>>
>>>> Btw, probably you will find another scalability issue as the number of MVs
>>>> grows large with the current implementation of the rewriting, since the´
>>>> pre-filtering implementation in place does not discard many of the views
that
>>>> are not valid to rewrite a given query, and rewriting is attempted with all
>>>> of them.
>>>> This last bit is work that I would like to tackle shortly, but I have not
>>>> created the corresponding JIRA yet.
>>>>
>>>> -Jesús
>>>>
>>>>
>>>>
>>>>
>>>> On 8/27/17, 10:43 PM, "Rajat Venkatesh" <rvenkatesh@qubole.com> wrote:
>>>>
>>>>> Thread Safety and repeated parsing is a problem. We have experience with
>>>>> managing 10s of materialized views. Repeated parsing takes more time
than
>>>>> execution of the query itself. We also have a similar problem where
>>>>> concurrent queries (with a different set of materialized views potentailly)
>>>>> maybe planned at the same time. We solved it through maintaining a cache
>>>>> and carefully setting the cache in a thread local.
>>>>> Relevant code for inspiration:
>>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/prepare/Materializer.java
>>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/plan/QuarkMaterializeCluster.java
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 27, 2017 at 6:50 PM Christian Beikov <christian.beikov@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey, I have been looking a bit into how materialized views perform
>>>>>> during the planning because of a very long test
>>>>>> run(MaterializationTest#testJoinMaterializationUKFK6) and the current
>>>>>> state is problematic.
>>>>>>
>>>>>> CalcitePrepareImpl#getMaterializations always reparses the SQL and
down
>>>>>> the line, there is a lot of expensive work(e.g. predicate and lineage
>>>>>> determination) done during planning that could easily be pre-calculated
>>>>>> and cached during materialization creation.
>>>>>>
>>>>>> There is also a bit of a thread safety problem with the current
>>>>>> implementation. Unless there is a different safety mechanism that
I
>>>>>> don't see, the sharing of the MaterializationService and thus also
the
>>>>>> maps in MaterializationActor via a static instance between multiple
>>>>>> threads is problematic.
>>>>>>
>>>>>> Since I mentioned thread safety, how is Calcite supposed to be used
in a
>>>>>> multi-threaded environment? Currently I use a connection pool that
>>>>>> initializes the schema on new connections, but that is not really
nice.
>>>>>> I suppose caches are also bound to the connection? A thread safe
context
>>>>>> that can be shared between connections would be nice to avoid all
that
>>>>>> repetitive work.
>>>>>>
>>>>>> Are these known issues which you have thought about how to fix or
should
>>>>>> I log JIRAs for these and fix them to the best of my knowledge? I'd
more
>>>>>> or less keep the service shared but would implement it using a copy
on
>>>>>> write strategy since I'd expect seldom schema changes after startup.
>>>>>>
>>>>>> Regarding the repetitive work that partly happens during planning,
I'd
>>>>>> suggest doing that during materialization registration instead like
it
>>>>>> is already mentioned CalcitePrepareImpl#populateMaterializations.
Would
>>>>>> that be ok?
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Mit freundlichen Grüßen,
>>>>>> ------------------------------------------------------------------------
>>>>>> *Christian Beikov*
>>>>>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message