drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: Memory Settings for a Non-Sorted Failed Query
Date Mon, 13 Jun 2016 23:10:30 GMT
The 512m block size worked.  My issue with the 1024m block size was on the
writing using a CTAS.... that's where my nodes got into a bad state....thus
I am wondering what setting on drill would be the right setting to help
node memory pressures on a CTAs using 1024m block size
On Jun 13, 2016 6:06 PM, "Parth Chandra" <pchandra@maprtech.com> wrote:

In general, you want to make the Parquet block size and the HDFS block size
the same. A Parquet block size that is larger than the HDFS block size can
split a Parquet block ( i.e. row_group ) across nodes and that will
severely affect performance as data reads will no longer be local. 512 MB
is a pretty good setting.

Note that you need to ensure the Parquet block size in the source file
which (maybe) was produced outside of Drill. So you will need to make the
change in the application used to write the Parquet file.

If you're using Drill to write the source file as well then, of course, the
block size setting will be used by the writer.

If you're using the new reader, then there is really no knob you have to
tweak. Is parquet-tools able to read the file(s)?



On Mon, Jun 13, 2016 at 1:59 PM, John Omernik <john@omernik.com> wrote:

> I am doing some performance testing, and per the Impala documentation, I
am
> trying to use a block size of 1024m in both Drill and MapR FS.  When I set
> the MFS block size to 512 and the Drill (default) block size I saw some
> performance improvements, and wanted to try the 1024 to see how it worked,
> however, my query hung and I got into that "bad state" where the nodes are
> not responding right and I have to restart my whole cluster (This really
> bothers me that a query can make the cluster be unresponsive)
>
> That said, what memory settings can I tweak to help the query work. This
is
> quite a bit of data, a CTAS from Parquet to Parquet, 100-130G of data per
> data (I am doing a day at a time), 103 columns.   I have to use the
> "use_new_reader" option due to my other issues, but other than that I am
> just setting the block size on MFS and then updating the block size in
> Drill, and it's dying. Since this is a simple CTAS (no sort) which
settings
> can be beneficial for what is happening here?
>
> Thanks
>
> John
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message