drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Jaimes III <rafjai...@gmail.com>
Subject Re: Planning times
Date Sat, 06 Jun 2020 20:44:33 GMT
Hi Avner,

What do you mean by metastore, are you running it through a Hive metastore
and plugin?

I would try to query against the dfs directly. I'm seeing much shorter
planning times than you with Drill 1.17 and no metastore. I don't usually
query a single file but I imagine that would be even faster.

What program made the parquet file? Do you know what the row group size is
and is it the same as your HDFS block size? They should be for best
performance. Is the schema consistent within the file or do you have nested
fields?

On Sat, Jun 6, 2020, 3:51 PM Avner Levy <avner.levy@gmail.com> wrote:

> Hi Charles,
> I'm using master branch (1.18.0-SNAPSHOT) docker.
> I've enabled the metastore, session wise and run the same query twice but
> still got the following times.
> Is there a way to pre-define the table's schema in a way that will reduce
> the query time?
> The query is:
> *select name from `parquet/data.parquet` limit 1*
>
> Any idea why planning takes so long on such trivial query?
> Does it include accessing the file for schema discovery?
> I'm providing the specific filename in the queries in order to reduce the
> file listing part.
> Thanks for your help,
>   Avner
>
>
>
>
>
>
>
> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled true*
>
>
> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cgivre@gmail.com> wrote:
>
> > Hi Avner,
> > Maybe you said this already but what version of Drill are you using and
> do
> > you have the metastore enabled?
> > --C
> >
> >
> >
> > > On Jun 4, 2020, at 9:02 PM, Avner Levy <avner.levy@gmail.com> wrote:
> > >
> > > Thanks Rafael for your answer.
> > > As I wrote in the previous email these planning times occur even when
> > > selecting one fields from one tiny file (60k) that I pass directly by
> > full
> > > path (select name from `parquet/data/data.parquet` limit 1).
> > > Any idea what can influence the time in such a trivial scenario?
> > > In addition, doesn't Drill cache execution plans between similar
> queries
> > > executions?
> > > Best regards,
> > > Avner
> > >
> > >
> > > On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <rafjaimes@gmail.com>
> > > wrote:
> > >
> > >> Hi Avner,
> > >>
> > >> One way you might be able to optimize this is by modifying the size
> > >> and number of the parquet files. How many files do you have and how
> > >> big are they? Do you know what the row group size is? What is the HDFS
> > >> block size is on your storage?
> > >>
> > >> There's probably a lot more intricate ways to improve performance with
> > >> the Drill settings, but I have not modified them.
> > >>
> > >> - Rafael
> > >>
> > >> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <avner.levy@gmail.com>
> wrote:
> > >>>
> > >>> I'm running Apache Drill (1.18 master branch) in a docker with data
> > >> stored
> > >>> in Parquet files on S3.
> > >>> When I run queries, even the most simple ones such as:
> > >>>
> > >>> select name from `parquet/data/data.parquet` limit 1
> > >>>
> > >>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only
> 0.112
> > >> sec.
> > >>> These proportions are maintained even if I run the same query
> multiple
> > >>> times in a row.
> > >>> Since I'm trying to minimize query times to a minimum, I was
> wondering
> > if
> > >>> such planning times (compared to execution) make sense and is there
> any
> > >> way
> > >>> to reduce it? (some plan caching mechanism)
> > >>> Thanks,
> > >>>  Avner
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message