drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Explain Plan for Parquet data is taking a lot of timre
Date Fri, 24 Feb 2017 00:44:47 GMT
You said there are 2144 parquet files but the plan suggests that you only
have a single parquet file. In any case its a long time to plan the query.
Did you try the metadata caching feature [1]?

Also how many rowgroups and columns are present in the parquet file?

[1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/

- Rahul

On Thu, Feb 23, 2017 at 4:24 PM, Jeena Vinod <jeena.vinod@oracle.com> wrote:

> Hi,
>
>
>
> Drill is taking 23 minutes for a simple select * query with limit 100 on
> 1GB uncompressed parquet data. EXPLAIN PLAN for this query is also taking
> that long(~23 minutes).
>
> Query: select * from <plugin>.root.`testdata` limit 100;
>
> Query  Plan:
>
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 100.0,
> cumulative cost = {32810.0 rows, 33110.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 1429
>
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount =
> 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 1428
>
> 00-02        SelectionVectorRemover : rowType = (DrillRecordRow[*]):
> rowcount = 100.0, cumulative cost = {32800.0 rows, 33100.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 1427
>
> 00-03          Limit(fetch=[100]) : rowType = (DrillRecordRow[*]):
> rowcount = 100.0, cumulative cost = {32700.0 rows, 33000.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 1426
>
> 00-04            Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=/testdata/part-r-00000-
> 097f7399-7bfb-4e93-b883-3348655fc658.parquet]], selectionRoot=/testdata,
> numFiles=1, usedMetadataFile=true, cacheFileRoot=/testdata,
> columns=[`*`]]]) : rowType = (DrillRecordRow[*]): rowcount = 32600.0,
> cumulative cost = {32600.0 rows, 32600.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 1425
>
>
>
> I am using Drill1.8 and it is setup on 5 node 32GB cluster and the data is
> in Oracle Storage Cloud Service. When I run the same query on 1GB TSV file
> in this location it is taking only 38 seconds .
>
> Also testdata contains around 2144 .parquet files each around 500KB.
>
>
>
> Is there any additional configuration required for parquet?
>
> Kindly suggest how to improve the response time here.
>
>
>
> Regards
> Jeena
>
>
>
>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message