flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-11050) When IntervalJoin, get left or right buffer's entries more quickly by assigning lowerBound
Date Tue, 04 Dec 2018 09:27:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708457#comment-16708457
] 

ASF GitHub Bot commented on FLINK-11050:
----------------------------------------

fhueske commented on issue #7226: FLINK-11050 add lowerBound and upperBound for optimizing
RocksDBMapState's entries
URL: https://github.com/apache/flink/pull/7226#issuecomment-444030473
 
 
   Hi @Myracle, thanks for the PR. I think we should either support the new API for all `MapState`
implementations or declare the method as a best-effort filter (which means we have to manually
filter the returned entries).
   
   What do you think about this @StefanRRichter?
   
   Best, Fabian

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> When IntervalJoin, get left or right buffer's entries more quickly by assigning lowerBound
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-11050
>                 URL: https://issues.apache.org/jira/browse/FLINK-11050
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Liu
>            Priority: Major
>              Labels: performance, pull-request-available
>
>     When IntervalJoin, it is very slow to get left or right buffer's entries. Because
we have to scan all buffer's values, including the deleted values which are out of time range.
These deleted values's processing consumes too much time in RocksDB's level 0. Since lowerBound
is known, it can be optimized by seek from the timestamp of lowerBound.
>     Our usage is like below:
> {code:java}
> labelStream.keyBy(uuid).intervalJoin(adLogStream.keyBy(uuid))
>            .between(Time.milliseconds(0), Time.milliseconds(600000))
>            .process(new processFunction())
>            .sink(kafkaProducer)
> {code}
>     Our data is huge. The job always runs for an hour and is stuck by RocksDB's seek
when get buffer's entries. We use rocksDB's data to simulate the problem RocksDB and find
that it takes too much time in deleted values. So we decide to optimize it by assigning the
lowerBound instead of global search.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message