drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt <bsg...@gmail.com>
Subject Re: Monitoring long / stuck CTAS
Date Thu, 28 May 2015 16:17:47 GMT
To make sure I am adjusting the correct config, these are heap parameters within the Drill
configure path, not for Hadoop or Zookeeper?


> On May 28, 2015, at 12:08 PM, Jason Altekruse <altekrusejason@gmail.com> wrote:
> 
> There should be no upper limit on the size of the tables you can create
> with Drill. Be advised that Drill does currently operate entirely
> optimistically in regards to available resources. If a network connection
> between two drillbits fails during a query, we will not currently
> re-schedule the work to make use of remaining nodes and network connections
> that are still live. While we have had a good amount of success using Drill
> for data conversion, be aware that these conditions could cause long
> running queries to fail.
> 
> That being said, it isn't the only possible cause for such a failure. In
> the case of a network failure we would expect to see a message returned to
> you that part of the query was unsuccessful and that it had been cancelled.
> Andries has a good suggestion in regards to checking the heap memory, this
> should also be detected and reported back to you at the CLI, but we may be
> failing to propagate the error back to the head node for the query. I
> believe writing parquet may still be the most heap-intensive operation in
> Drill, despite our efforts to refactor the write path to use direct memory
> instead of on-heap for large buffers needed in the process of creating
> parquet files.
> 
>> On Thu, May 28, 2015 at 8:43 AM, Matt <bsg075@gmail.com> wrote:
>> 
>> Is 300MM records too much to do in a single CTAS statement?
>> 
>> After almost 23 hours I killed the query (^c) and it returned:
>> 
>> ~~~
>> +-----------+----------------------------+
>> | Fragment  | Number of records written  |
>> +-----------+----------------------------+
>> | 1_20      | 13568824                   |
>> | 1_15      | 12411822                   |
>> | 1_7       | 12470329                   |
>> | 1_12      | 13693867                   |
>> | 1_5       | 13292136                   |
>> | 1_18      | 13874321                   |
>> | 1_16      | 13303094                   |
>> | 1_9       | 13639049                   |
>> | 1_10      | 13698380                   |
>> | 1_22      | 13501073                   |
>> | 1_8       | 13533736                   |
>> | 1_2       | 13549402                   |
>> | 1_21      | 13665183                   |
>> | 1_0       | 13544745                   |
>> | 1_4       | 13532957                   |
>> | 1_19      | 12767473                   |
>> | 1_17      | 13670687                   |
>> | 1_13      | 13469515                   |
>> | 1_23      | 12517632                   |
>> | 1_6       | 13634338                   |
>> | 1_14      | 13611322                   |
>> | 1_3       | 13061900                   |
>> | 1_11      | 12760978                   |
>> +-----------+----------------------------+
>> 23 rows selected (82294.854 seconds)
>> ~~~
>> 
>> The sum of those record counts is  306,772,763 which is close to the
>> 320,843,454 in the source file:
>> 
>> ~~~
>> 0: jdbc:drill:zk=es05:2181> select count(*)  FROM root.`sample_201501.dat`;
>> +------------+
>> |   EXPR$0   |
>> +------------+
>> | 320843454  |
>> +------------+
>> 1 row selected (384.665 seconds)
>> ~~~
>> 
>> 
>> It represents one month of data, 4 key columns and 38 numeric measure
>> columns, which could also be partitioned daily. The test here was to create
>> monthly Parquet files to see how the min/max stats on Parquet chunks help
>> with range select performance.
>> 
>> Instead of a small number of large monthly RDBMS tables, I am attempting
>> to determine how many Parquet files should be used with Drill / HDFS.
>> 
>> 
>> 
>> 
>> On 27 May 2015, at 15:17, Matt wrote:
>> 
>> Attempting to create a Parquet backed table with a CTAS from an 44GB tab
>>> delimited file in HDFS. The process seemed to be running, as CPU and IO was
>>> seen on all 4 nodes in this cluster, and .parquet files being created in
>>> the expected path.
>>> 
>>> In however in the last two hours or so, all nodes show near zero CPU or
>>> IO, and the Last Modified date on the .parquet have not changed. Same time
>>> delay shown in the Last Progress column in the active fragment profile.
>>> 
>>> What approach can I take to determine what is happening (or not)?
>> 

Mime
View raw message