drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeena Vinod <jeena.vi...@oracle.com>
Subject RE: Explain Plan for Parquet data is taking a lot of timre
Date Fri, 24 Feb 2017 21:56:23 GMT
Thanks for the suggestions.

I did run REFRESH TABLE METADATA command on this path before firing select query.

In Drill 1.9, there is an improvement in performance. I have 1.9 setup on a 2 node 16GB cluster
and here select * with limit 100 is taking less time than 1.8, though the number of rows in
ParquetGroupScan remains unchanged. Select query is taking around 8 minutes and explain plan
took around 7 minutes. Also in the Web console profile, the query stays in the STARTING status
for almost 7 minutes.

Query Plan for 1.9:
00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0, cumulative cost = {32810.0
rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1721
00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 100.0, cumulative cost
= {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1720
00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]): rowcount = 100.0, cumulative
cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1719
00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]): rowcount = 100.0, cumulative
cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1718
00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/testdata/part-r-00000-097f7399-7bfb-4e93-b883-3348655fc658.parquet]],
selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata, columns=[`*`]]])
: rowType = (DrillRecordRow[*]): rowcount = 32600.0, cumulative cost = {32600.0 rows, 32600.0
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 1717

And from the query profile, it looks like the most time is spent in PARQUET_ROW_GROUP_SCAN.
I can attach the full JSON profile if it helps.

Can there be further improvement in performance with 1.9?

Regards
Jeena
 

-----Original Message-----
From: Padma Penumarthy [mailto:ppenumarthy@mapr.com] 
Sent: Friday, February 24, 2017 11:22 PM
To: user@drill.apache.org
Subject: Re: Explain Plan for Parquet data is taking a lot of timre

Yes, limit is pushed down to parquet reader in 1.9. But, that will not help with planning
time. 
It is definitely worth trying with 1.9 though.

Thanks,
Padma


> On Feb 24, 2017, at 7:26 AM, Andries Engelbrecht <aengelbrecht@mapr.com> wrote:
> 
> Looks like the metadata cache is being used  "usedMetadataFile=true, ". But to be sure
did you perform a REFRESH TABLE METADATA <path to table> on the parquet data?
> 
> 
> However it looks like it is reading a full batch " rowcount = 32600.0, cumulative cost
= {32600.0 rows, 32600.0"
> 
> 
> Didn't the limit operator get pushed down to the parquet reader in 1.9?
> 
> Perhaps try 1.9 and see if in the ParquetGroupScan the number of rows gets reduced to
100.
> 
> 
> Can you look in the query profile where time is spend, also how long it takes before
the query starts to run in the WebUI profile.
> 
> 
> Best Regards
> 
> 
> Andries Engelbrecht
> 
> 
> Senior Solutions Architect
> 
> MapR Alliances and Channels Engineering
> 
> 
> aengelbrecht@mapr.com
> 
> 
> [1483990071965_mapr-logo-signature.png]
> 
> ________________________________
> From: Jinfeng Ni <jni@apache.org>
> Sent: Thursday, February 23, 2017 4:53:34 PM
> To: user
> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
> 
> The reason the plan shows only one single parquet file is because 
> "LIMIT 100" is applied and filter out the rest of them.
> 
> Agreed that parquet metadata caching might help reduce planning time, 
> when there are large number of parquet files.
> 
> On Thu, Feb 23, 2017 at 4:44 PM, rahul challapalli 
> <challapallirahul@gmail.com> wrote:
>> You said there are 2144 parquet files but the plan suggests that you 
>> only have a single parquet file. In any case its a long time to plan the query.
>> Did you try the metadata caching feature [1]?
>> 
>> Also how many rowgroups and columns are present in the parquet file?
>> 
>> [1] 
>> https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>> 
>> - Rahul
>> 
>> On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <jeena.vinod@oracle.com> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> Drill is taking 23 minutes for a simple select * query with limit 
>>> 100 on 1GB uncompressed parquet data. EXPLAIN PLAN for this query is 
>>> also taking that long(~23 minutes).
>>> 
>>> Query: select * from <plugin>.root.`testdata` limit 100;
>>> 
>>> Query  Plan:
>>> 
>>> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0,
>>> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 
>>> 0.0 memory}, id = 1429
>>> 
>>> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
>>> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 
>>> network,
>>> 0.0 memory}, id = 1428
>>> 
>>> 00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]):
>>> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 
>>> io, 0.0 network, 0.0 memory}, id = 1427
>>> 
>>> 00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
>>> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 
>>> io, 0.0 network, 0.0 memory}, id = 1426
>>> 
>>> 00-04            Scan(groupscan=[ParquetGroupScan
>>> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000- 
>>> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], 
>>> selectionRoot=/testdata, numFiles=1, usedMetadataFile=true, 
>>> cacheFileRoot=/testdata,
>>> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 
>>> 32600.0, cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 
>>> network, 0.0 memory}, id = 1425
>>> 
>>> 
>>> 
>>> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the 
>>> data is in Oracle Storage Cloud Service. When I run the same query 
>>> on 1GB TSV file in this location it is taking only 38 seconds .
>>> 
>>> Also testdata contains around 2144 .parquet files each around 500KB.
>>> 
>>> 
>>> 
>>> Is there any additional configuration required for parquet?
>>> 
>>> Kindly suggest how to improve the response time here.
>>> 
>>> 
>>> 
>>> Regards
>>> Jeena
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 


Mime
View raw message