drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt <bsg...@gmail.com>
Subject Re: Monitoring long / stuck CTAS
Date Thu, 28 May 2015 15:43:22 GMT
Is 300MM records too much to do in a single CTAS statement?

After almost 23 hours I killed the query (^c) and it returned:

~~~
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_20      | 13568824                   |
| 1_15      | 12411822                   |
| 1_7       | 12470329                   |
| 1_12      | 13693867                   |
| 1_5       | 13292136                   |
| 1_18      | 13874321                   |
| 1_16      | 13303094                   |
| 1_9       | 13639049                   |
| 1_10      | 13698380                   |
| 1_22      | 13501073                   |
| 1_8       | 13533736                   |
| 1_2       | 13549402                   |
| 1_21      | 13665183                   |
| 1_0       | 13544745                   |
| 1_4       | 13532957                   |
| 1_19      | 12767473                   |
| 1_17      | 13670687                   |
| 1_13      | 13469515                   |
| 1_23      | 12517632                   |
| 1_6       | 13634338                   |
| 1_14      | 13611322                   |
| 1_3       | 13061900                   |
| 1_11      | 12760978                   |
+-----------+----------------------------+
23 rows selected (82294.854 seconds)
~~~

The sum of those record counts is  306,772,763 which is close to the  
320,843,454 in the source file:

~~~
0: jdbc:drill:zk=es05:2181> select count(*)  FROM 
root.`sample_201501.dat`;
+------------+
|   EXPR$0   |
+------------+
| 320843454  |
+------------+
1 row selected (384.665 seconds)
~~~


It represents one month of data, 4 key columns and 38 numeric measure 
columns, which could also be partitioned daily. The test here was to 
create monthly Parquet files to see how the min/max stats on Parquet 
chunks help with range select performance.

Instead of a small number of large monthly RDBMS tables, I am attempting 
to determine how many Parquet files should be used with Drill / HDFS.



On 27 May 2015, at 15:17, Matt wrote:

> Attempting to create a Parquet backed table with a CTAS from an 44GB 
> tab delimited file in HDFS. The process seemed to be running, as CPU 
> and IO was seen on all 4 nodes in this cluster, and .parquet files 
> being created in the expected path.
>
> In however in the last two hours or so, all nodes show near zero CPU 
> or IO, and the Last Modified date on the .parquet have not changed. 
> Same time delay shown in the Last Progress column in the active 
> fragment profile.
>
> What approach can I take to determine what is happening (or not)?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message