kylin-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KYLIN-3925) Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid generating small hdfs files
Date Wed, 03 Apr 2019 10:27:00 GMT

    [ https://issues.apache.org/jira/browse/KYLIN-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808587#comment-16808587
] 

ASF GitHub Bot commented on KYLIN-3925:
---------------------------------------

kyotoYaho commented on pull request #580: KYLIN-3925 Add reduce step for FilterRecommendCuboidDataJob
& UpdateO…
URL: https://github.com/apache/kylin/pull/580
 
 
   …ldCuboidShardJob to avoid generating small hdfs files
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Add reduce step for FilterRecommendCuboidDataJob & UpdateOldCuboidShardJob to avoid
generating small hdfs files
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3925
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3925
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>
> Previously when doing cube optimization, there're two map only MR jobs: *FilterRecommendCuboidDataJob*
& *UpdateOldCuboidShardJob*. The benefit of map only job is to avoid shuffling. However,
this benefit will bring a more severe issue, too many small hdfs files.
> Suppose there're 10 hdfs files for current cuboids data and each with 500M. If the block
size is 100M, there'll be 10*(500/100) mappers for the map only job *FilterRecommendCuboidDataJob*.
Each mapper will generate a hdfs file. Finally there'll be 50 hdfs files. Since the job *FilterRecommendCuboidDataJob*
will filter out the cuboid data used for future, the data size of each file will be less than
100M. In some cases, it will be even less than 50M.
> To avoid this kind of small hdfs file issue, it's better to add a reduce step to control
the final output hdfs file number.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message