drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andries Engelbrecht <aengelbre...@maprtech.com>
Subject Re: Monitoring long / stuck CTAS
Date Thu, 28 May 2015 16:01:37 GMT
I have used a single CTAS to create tables using parquet with 1.5B rows.

It did consume a lot of heap memory on the Drillbits and I had to increase the heap size.
Check your logs to see if you are running out of heap memory.

I used 128MB parquet block size.

This was with Drill 0.9 , so I’m sure 1.0 will be better in this regard.

—Andries



On May 28, 2015, at 8:43 AM, Matt <bsg075@gmail.com> wrote:

> Is 300MM records too much to do in a single CTAS statement?
> 
> After almost 23 hours I killed the query (^c) and it returned:
> 
> ~~~
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 1_20      | 13568824                   |
> | 1_15      | 12411822                   |
> | 1_7       | 12470329                   |
> | 1_12      | 13693867                   |
> | 1_5       | 13292136                   |
> | 1_18      | 13874321                   |
> | 1_16      | 13303094                   |
> | 1_9       | 13639049                   |
> | 1_10      | 13698380                   |
> | 1_22      | 13501073                   |
> | 1_8       | 13533736                   |
> | 1_2       | 13549402                   |
> | 1_21      | 13665183                   |
> | 1_0       | 13544745                   |
> | 1_4       | 13532957                   |
> | 1_19      | 12767473                   |
> | 1_17      | 13670687                   |
> | 1_13      | 13469515                   |
> | 1_23      | 12517632                   |
> | 1_6       | 13634338                   |
> | 1_14      | 13611322                   |
> | 1_3       | 13061900                   |
> | 1_11      | 12760978                   |
> +-----------+----------------------------+
> 23 rows selected (82294.854 seconds)
> ~~~
> 
> The sum of those record counts is  306,772,763 which is close to the  320,843,454 in
the source file:
> 
> ~~~
> 0: jdbc:drill:zk=es05:2181> select count(*)  FROM root.`sample_201501.dat`;
> +------------+
> |   EXPR$0   |
> +------------+
> | 320843454  |
> +------------+
> 1 row selected (384.665 seconds)
> ~~~
> 
> 
> It represents one month of data, 4 key columns and 38 numeric measure columns, which
could also be partitioned daily. The test here was to create monthly Parquet files to see
how the min/max stats on Parquet chunks help with range select performance.
> 
> Instead of a small number of large monthly RDBMS tables, I am attempting to determine
how many Parquet files should be used with Drill / HDFS.
> 
> 
> 
> On 27 May 2015, at 15:17, Matt wrote:
> 
>> Attempting to create a Parquet backed table with a CTAS from an 44GB tab delimited
file in HDFS. The process seemed to be running, as CPU and IO was seen on all 4 nodes in this
cluster, and .parquet files being created in the expected path.
>> 
>> In however in the last two hours or so, all nodes show near zero CPU or IO, and the
Last Modified date on the .parquet have not changed. Same time delay shown in the Last Progress
column in the active fragment profile.
>> 
>> What approach can I take to determine what is happening (or not)?


Mime
View raw message