drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Khurram Faraaz <kfar...@mapr.com>
Subject Re: Increasing store.parquet.block-size
Date Thu, 15 Jun 2017 18:35:18 GMT
Thanks Padma.

________________________________
From: Padma Penumarthy <ppenumarthy@mapr.com>
Sent: Thursday, June 15, 2017 8:58:44 AM
To: user@drill.apache.org
Subject: Re: Increasing store.parquet.block-size

Sure. I will check and try to fix them as well.

Thanks,
Padma

> On Jun 14, 2017, at 3:12 AM, Khurram Faraaz <kfaraaz@mapr.com> wrote:
>
> Thanks Padma. There are some more related failures reported in DRILL-2478, do you think
we should fix them too, if it is an easy fix.
>
>
> Regards,
>
> Khurram
>
> ________________________________
> From: Padma Penumarthy <ppenumarthy@mapr.com>
> Sent: Wednesday, June 14, 2017 11:43:16 AM
> To: user@drill.apache.org
> Subject: Re: Increasing store.parquet.block-size
>
> I think you meant MB (not GB) below.
> HDFS allows creation of very large files(theoretically, there is no limit).
> I am wondering why >2GB file is a problem. May be it is blockSize >2GB, that is
not recommended.
>
> Anyways, we should not let the user be able to set any value and later throw an error.
> I opened a PR to fix this.
> https://github.com/apache/drill/pull/852
>
> Thanks,
> Padma
>
>
> On Jun 9, 2017, at 11:36 AM, Kunal Khatua <kkhatua@mapr.com<mailto:kkhatua@mapr.com>>
wrote:
>
> The ideal size depends on what engine is consuming the parquet files (Drill, i'm guessing)....
and the storage layer. For HDFS, which is usually 128-256GB, we recommend to bump it to about
512GB (with the underlying HDFS blocksize to match that).
>
>
> You'll probably need to experiment a little with different blocks sizes stored on S3
to see which works the best.
>
> <http://www.mapr.com/>
>
> ________________________________
> From: Shuporno Choudhury <shuporno.choudhury@manthan.com<mailto:shuporno.choudhury@manthan.com>>
> Sent: Friday, June 9, 2017 11:23:37 AM
> To: user@drill.apache.org<mailto:user@drill.apache.org>
> Subject: Re: Increasing store.parquet.block-size
>
> Thanks for the information Kunal.
> After the conversion, the file size scales down to half if I use gzip
> compression.
> For a 10 GB gzipped csv source file, it becomes 5GB (2+2+1) parquet file
> (using gzip compression).
> So, if I have to make multiple parquet files, what block size would be
> optimal, if I have to read the file later?
>
> On 09-Jun-2017 11:28 PM, "Kunal Khatua" <kkhatua@mapr.com<mailto:kkhatua@mapr.com>>
wrote:
>
>
> If you're storing this in S3... you might want to selectively read the
> files as well.
>
>
> I'm only speculating, but if you want to download the data, downloading as
> a queue of files might be more reliable than one massive file. Similarly,
> within AWS, it *might* be faster to have an EC2 instance access a couple of
> large Parquet files versus one massive Parquet file.
>
>
> Remember that when you create a large block size, Drill tries to write
> everything within a single row group for each. So there is no chance of
> parallelization of the read (i.e. reading parts in parallel). The defaults
> should work well for S3 as well, and with the compression (e.g. Snappy),
> you should get a reasonably smaller file size.
>
>
> With the current default settings... have you seen what Parquet file sizes
> you get with Drill when converting your 10GB CSV source files?
>
>
> ________________________________
> From: Shuporno Choudhury <shuporno.choudhury@manthan.com<mailto:shuporno.choudhury@manthan.com>>
> Sent: Friday, June 9, 2017 10:50:06 AM
> To: user@drill.apache.org<mailto:user@drill.apache.org>
> Subject: Re: Increasing store.parquet.block-size
>
> Thanks Kunal for your insight.
> I am actually converting some .csv files and storing them in parquet format
> in s3, not in HDFS.
> The size of the individual .csv source files can be quite huge (around
> 10GB).
> So, is there a way to overcome this and create one parquet file or do I
> have to go ahead with multiple parquet files?
>
> On 09-Jun-2017 11:04 PM, "Kunal Khatua" <kkhatua@mapr.com<mailto:kkhatua@mapr.com>>
wrote:
>
> Shuporno
>
>
> There are some interesting problems when using Parquet files > 2GB on
> HDFS.
>
>
> If I'm not mistaken, the HDFS APIs that allow you to read offsets (oddly
> enough) returns an int value. Large Parquet blocksize also means you'll
> end
> up having the file span across multiple HDFS blocks, and that would make
> reading of rowgroups inefficient.
>
>
> Is there a reason you want to create such a large parquet file?
>
>
> ~ Kunal
>
> ________________________________
> From: Vitalii Diravka <vitalii.diravka@gmail.com<mailto:vitalii.diravka@gmail.com>>
> Sent: Friday, June 9, 2017 4:49:02 AM
> To: user@drill.apache.org<mailto:user@drill.apache.org>
> Subject: Re: Increasing store.parquet.block-size
>
> Khurram,
>
> DRILL-2478 is a good place holder for the LongValidator issue, it really
> works wrong.
>
> But other issue connected to impossibility to use long values for parquet
> block-size.
> This issue can be independent task or a sub-task of updating Drill
> project
> to a latest parquet library.
>
> Kind regards
> Vitalii
>
> On Fri, Jun 9, 2017 at 10:25 AM, Khurram Faraaz <kfaraaz@mapr.com<mailto:kfaraaz@mapr.com>>
> wrote:
>
> 1.  DRILL-2478<https://issues.apache.org/jira/browse/DRILL-2478> is
> Open for this issue.
> 2.  I have added more details into the comments.
>
> Thanks,
> Khurram
>
> ________________________________
> From: Shuporno Choudhury <shuporno.choudhury@manthan.com<mailto:shuporno.choudhury@manthan.com>>
> Sent: Friday, June 9, 2017 12:48:41 PM
> To: user@drill.apache.org<mailto:user@drill.apache.org>
> Subject: Increasing store.parquet.block-size
>
> The max value that can be assigned to *store.parquet.block-size *is
> *2147483647*, as the value kind of this configuration parameter is
> LONG.
> This basically translates to 2GB of block size.
> How do I increase it to 3/4/5 GB ?
> Trying to set this parameter to a higher value using the following
> command
> actually succeeds :
>   ALTER SYSTEM SET `store.parquet.block-size` = 4294967296;
> But when I try to run a query that uses this config, it throws the
> following error:
>  Error: SYSTEM ERROR: NumberFormatException: For input string:
> "4294967296"
> So, is it possible to assign a higher value to this parameter?
> --
> Regards,
> Shuporno Choudhury
>
>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message