flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2240) Use BloomFilter to minimize probe side records which are spilled to disk in Hybrid-Hash-Join
Date Mon, 06 Jul 2015 08:52:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614711#comment-14614711
] 

ASF GitHub Bot commented on FLINK-2240:
---------------------------------------

GitHub user ChengXiangLi opened a pull request:

    https://github.com/apache/flink/pull/888

    [FLINK-2240] Use BloomFilter to filter probe records in Hybrid-Hash-Join

    In Hybrid-Hash-Join, while small table does not fit into memory, part of the small table
data would be spilled to disk, and the counterpart partition of big table data would be spilled
to disk in probe phase as well. If we build a BloomFilter while spill small table to disk
during build phase, and use it to filter the big table records which tend to be spilled to
disk, this may greatly reduce the spilled big table file size, and saved the disk IO cost
for writing and further reading.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ChengXiangLi/flink hj-bloomfilter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/888.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #888
    
----
commit 78c59d6ee52a00fd4964001cbce81437c38d86cb
Author: chengxiang li <chengxiang.li@intel.com>
Date:   2015-07-03T15:53:47Z

    add bloom filter for spilled partitions in hashtable.

commit cacaa9a15a5330c6130306841ef73958490cf69d
Author: chengxiang li <chengxiang.li@intel.com>
Date:   2015-07-06T07:15:39Z

    fix previous get buckets method

commit 6bbbb27d4935da72ae44ec404f884a74de7bbc4c
Author: chengxiang li <chengxiang.li@intel.com>
Date:   2015-07-06T08:07:30Z

    fix  some format issues.

commit b7fee8d26445db4bba7928bfff8a9dd5ada8cd03
Author: chengxiang li <chengxiang.li@intel.com>
Date:   2015-07-06T08:08:52Z

    Merge remote-tracking branch 'upstream/master' into hj-bloomfilter

commit d352c090b9c06baf701235809f7dfd0b4e9b87af
Author: Li <chengxiang.li@intel.com>
Date:   2015-07-06T08:44:13Z

    add tab as indent of blank line.

commit edacfb3ae17beeb84630d73f8452629d3e19b66b
Author: Li <chengxiang.li@intel.com>
Date:   2015-07-06T08:48:56Z

    fix tab indent for blank lines.

----


> Use BloomFilter to minimize probe side records which are spilled to disk in Hybrid-Hash-Join
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-2240
>                 URL: https://issues.apache.org/jira/browse/FLINK-2240
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Chengxiang Li
>            Assignee: Chengxiang Li
>            Priority: Minor
>
> In Hybrid-Hash-Join, while small table does not fit into memory, part of the small table
data would be spilled to disk, and the counterpart partition of big table data would be spilled
to disk in probe phase as well. If we build a BloomFilter while spill small table to disk
during build phase, and use it to filter the big table records which tend to be spilled to
disk, this may greatly  reduce the spilled big table file size, and saved the disk IO cost
for writing and further reading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message