drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: Explain Plan for Parquet data is taking a lot of timre
Date Fri, 24 Feb 2017 00:53:34 GMT
The reason the plan shows only one single parquet file is because
"LIMIT 100" is applied and filter out the rest of them.

Agreed that parquet metadata caching might help reduce planning time,
when there are large number of parquet files.

On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli
<challapallirahul@gmail.com> wrote:
> You said there are 2144 parquet files but the plan suggests that you only
> have a single parquet file. In any case its a long time to plan the query.
> Did you try the metadata caching feature [1]?
>
> Also how many rowgroups and columns are present in the parquet file?
>
> [1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>
> - Rahul
>
> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <jeena.vinod@oracle.com> wrote:
>
>> Hi,
>>
>>
>>
>> Drill is taking 23 minutes for a simple select * query with limit 100 on
>> 1GB uncompressed parquet data. EXPLAIN PLAN for this query is also taking
>> that long(~23 minutes).
>>
>> Query: select * from <plugin>.root.`testdata` limit 100;
>>
>> Query  Plan:
>>
>> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0
>> memory}, id = 1429
>>
>> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network,
>> 0.0 memory}, id = 1428
>>
>> 00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
>> network, 0.0 memory}, id = 1427
>>
>> 00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0
>> network, 0.0 memory}, id = 1426
>>
>> 00-04            Scan(groupscan=[ParquetGroupScan
>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000-
>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], selectionRoot=/testdata,
>> numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata,
>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 32600.0,
>> cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 network, 0.0
>> memory}, id = 1425
>>
>>
>>
>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is
>> in Oracle Storage Cloud Service. When I run the same query on 1GB TSV file
>> in this location it is taking only 38 seconds .
>>
>> Also testdata contains around 2144 .parquet files each around 500KB.
>>
>>
>>
>> Is there any additional configuration required for parquet?
>>
>> Kindly suggest how to improve the response time here.
>>
>>
>>
>> Regards
>> Jeena
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Mime
View raw message