drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kkhatua <...@git.apache.org>
Subject [GitHub] drill issue #826: DRILL-5379: Set Hdfs Block Size based on Parquet Block Siz...
Date Thu, 29 Jun 2017 01:25:59 GMT
Github user kkhatua commented on the issue:

    https://github.com/apache/drill/pull/826
  
    @ppadma , Khurram [~khfaraaz] and I were looking at the details in the PR and it's not
very clear what new behavior does the PR allow. If you need to specify the block-size as described
in the [comment ](https://issues.apache.org/jira/browse/DRILL-5379?focusedCommentId=15981366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15981366)by
@fmethot , doesn't Drill already do that? I thought Drill implicitly creates files with a
single row-group anyway. 
    
    My understanding of the JIRA's problem statement was that if the Parquet block-size (i.e.
the rowgroup size) is set to a large value that exceeds the HDFS block size, using the flag
would allow Drill to ignore the larger value in the options and write with a parquet-blocksize
that matches the target HDFS location. So, I could have {{store.parquet.block-size=1073741824}}
(i.e. 1GB), but when writing an output worth 512MB, instead of 1 file... Drill would read
the HDFS block-size (say 128GB) and apply that as the parquet-block-size and write 4 files.

    
    @fmethot is that what you were looking for? An **automatic scaling down** of the parquet
file's size to match (and be contained within) the HDFS block size?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message