The time seems pretty long for that file size. What type of file is it?
Is the CTAS running single threaded?
—Andries
On May 28, 2015, at 9:37 AM, Matt <bsg075@gmail.com> wrote:
>> How large is the data set you are working with, and your cluster/nodes?
>
> Just testing with that single 44GB source file currently, and my test cluster is made
from 4 nodes, each with 8 CPU cores, 32GB RAM, a 6TB Ext4 volume (RAID-10).
>
> Drill defaults left as come in v1.0. I will be adjusting memory and retrying the CTAS.
>
> I know I can / should assign individual disks to HDFS, but as a test cluster there are
apps that expect data volumes to work on. A dedicated Hadoop production cluster would have
a disk layout specific to the task.
>
>
> On 28 May 2015, at 12:26, Andries Engelbrecht wrote:
>
>> Just check the drillbit.log and drillbit.out files in the log directory.
>> Before adjusting memory, see if that is an issue first. It was for me, but as Jason
mentioned there can be other causes as well.
>>
>> You adjust memory allocation in the drill-env.sh files, and have to restart the drill
bits.
>>
>> How large is the data set you are working with, and your cluster/nodes?
>>
>> —Andries
>>
>>
>> On May 28, 2015, at 9:17 AM, Matt <bsg075@gmail.com> wrote:
>>
>>> To make sure I am adjusting the correct config, these are heap parameters within
the Drill configure path, not for Hadoop or Zookeeper?
>>>
>>>
>>>> On May 28, 2015, at 12:08 PM, Jason Altekruse <altekrusejason@gmail.com>
wrote:
>>>>
>>>> There should be no upper limit on the size of the tables you can create
>>>> with Drill. Be advised that Drill does currently operate entirely
>>>> optimistically in regards to available resources. If a network connection
>>>> between two drillbits fails during a query, we will not currently
>>>> re-schedule the work to make use of remaining nodes and network connections
>>>> that are still live. While we have had a good amount of success using Drill
>>>> for data conversion, be aware that these conditions could cause long
>>>> running queries to fail.
>>>>
>>>> That being said, it isn't the only possible cause for such a failure. In
>>>> the case of a network failure we would expect to see a message returned to
>>>> you that part of the query was unsuccessful and that it had been cancelled.
>>>> Andries has a good suggestion in regards to checking the heap memory, this
>>>> should also be detected and reported back to you at the CLI, but we may be
>>>> failing to propagate the error back to the head node for the query. I
>>>> believe writing parquet may still be the most heap-intensive operation in
>>>> Drill, despite our efforts to refactor the write path to use direct memory
>>>> instead of on-heap for large buffers needed in the process of creating
>>>> parquet files.
>>>>
>>>>> On Thu, May 28, 2015 at 8:43 AM, Matt <bsg075@gmail.com> wrote:
>>>>>
>>>>> Is 300MM records too much to do in a single CTAS statement?
>>>>>
>>>>> After almost 23 hours I killed the query (^c) and it returned:
>>>>>
>>>>> ~~~
>>>>> +-----------+----------------------------+
>>>>> | Fragment | Number of records written |
>>>>> +-----------+----------------------------+
>>>>> | 1_20 | 13568824 |
>>>>> | 1_15 | 12411822 |
>>>>> | 1_7 | 12470329 |
>>>>> | 1_12 | 13693867 |
>>>>> | 1_5 | 13292136 |
>>>>> | 1_18 | 13874321 |
>>>>> | 1_16 | 13303094 |
>>>>> | 1_9 | 13639049 |
>>>>> | 1_10 | 13698380 |
>>>>> | 1_22 | 13501073 |
>>>>> | 1_8 | 13533736 |
>>>>> | 1_2 | 13549402 |
>>>>> | 1_21 | 13665183 |
>>>>> | 1_0 | 13544745 |
>>>>> | 1_4 | 13532957 |
>>>>> | 1_19 | 12767473 |
>>>>> | 1_17 | 13670687 |
>>>>> | 1_13 | 13469515 |
>>>>> | 1_23 | 12517632 |
>>>>> | 1_6 | 13634338 |
>>>>> | 1_14 | 13611322 |
>>>>> | 1_3 | 13061900 |
>>>>> | 1_11 | 12760978 |
>>>>> +-----------+----------------------------+
>>>>> 23 rows selected (82294.854 seconds)
>>>>> ~~~
>>>>>
>>>>> The sum of those record counts is 306,772,763 which is close to the
>>>>> 320,843,454 in the source file:
>>>>>
>>>>> ~~~
>>>>> 0: jdbc:drill:zk=es05:2181> select count(*) FROM root.`sample_201501.dat`;
>>>>> +------------+
>>>>> | EXPR$0 |
>>>>> +------------+
>>>>> | 320843454 |
>>>>> +------------+
>>>>> 1 row selected (384.665 seconds)
>>>>> ~~~
>>>>>
>>>>>
>>>>> It represents one month of data, 4 key columns and 38 numeric measure
>>>>> columns, which could also be partitioned daily. The test here was to
create
>>>>> monthly Parquet files to see how the min/max stats on Parquet chunks
help
>>>>> with range select performance.
>>>>>
>>>>> Instead of a small number of large monthly RDBMS tables, I am attempting
>>>>> to determine how many Parquet files should be used with Drill / HDFS.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 27 May 2015, at 15:17, Matt wrote:
>>>>>
>>>>> Attempting to create a Parquet backed table with a CTAS from an 44GB
tab
>>>>>> delimited file in HDFS. The process seemed to be running, as CPU
and IO was
>>>>>> seen on all 4 nodes in this cluster, and .parquet files being created
in
>>>>>> the expected path.
>>>>>>
>>>>>> In however in the last two hours or so, all nodes show near zero
CPU or
>>>>>> IO, and the Last Modified date on the .parquet have not changed.
Same time
>>>>>> delay shown in the Last Progress column in the active fragment profile.
>>>>>>
>>>>>> What approach can I take to determine what is happening (or not)?
>>>>>
|