drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avner Levy <avner.l...@gmail.com>
Subject Re: Planning times
Date Mon, 15 Jun 2020 17:35:28 GMT
Thanks Rafael!
I've done as you suggested and planning on local files takes 0.3 second
compared to 1.5 second on the same query & files on S3.
I'll try to use some predefined schema and see if it helps to reduce query
planning times.
Thank you all for your help.
Avner

On Wed, Jun 10, 2020 at 12:04 PM Rafael Jaimes III <rafjaimes@gmail.com>
wrote:

> Parquet uses AVRO schema file, but I think that might be going down
> the wrong path. Drill's schema discovery should be really quick unless
> you have a really weird schema or non-consistent schema.
>
> It may indeed have something to do with S3 or data/network access
> latency. It's very straightforward to deploy a local embedded instance
> of Drill by using the embedded drillbit on a Linux machine. You can
> also use the Docker image from dockerhub if you are more comfortable
> with that. It takes less than 5 minutes to go from zero to querying a
> local parquet file you can use to test. You'll see the planning times
> significantly less on your local machine. Although the plugin is
> called "dfs" for HDFS, it works just fine on a regular drive and
> filesystem.
>
> Let us know if you see the improvement with a local query. If not, it
> may indicate something sub-optimal with your Parquet file. If yes, it
> would narrow down something with S3 as the limitation.
>
> I also recommend 1.17, unless there is something specific about 1.18
> that you are using.
>
> On Sun, Jun 7, 2020 at 10:53 PM Charles Givre <cgivre@gmail.com> wrote:
> >
> > Hi Avner,
> > So there is a way to define a schema both at query time and for a group
> of files.
> >
> > To add a schema at query time you can do the following:
> >
> > SELECT *
> > FROM table(dfs.`hdf5/non-scalar.h5` (type => 'hdf5', defaultPath =>
> '/compound', schema => 'inline=(field_1 int not null, field_2 double not
> null, field_3 varchar not null, fixed_field int not null default `20`)'))
> > WHERE field_1 < 5
> >
> > I think this will work for parquet files, but I'm not 100% certain.
> > Secondly, there is also the Drill metastore.  You can read about that
> here:  http://drill.apache.org/docs/using-drill-metastore/ <
> http://drill.apache.org/docs/using-drill-metastore/>
> >
> > Lastly, there is a way to provide an actual schema file for your parquet
> files but I'll have to dig that up.  However, with all that said, I didn't
> realize that you were doing this on S3. I'd concur with Rafael,  that you
> should try on your local filesystem first and see what the query times look
> like.  Generally speaking, for a small file size like that, querying a
> parquet file should be nearly instantaneous, with or without the schema or
> metastore.
> > Good luck!
> > -- C
> >
> >
> > > On Jun 7, 2020, at 11:08 AM, Avner Levy <avner.levy@gmail.com> wrote:
> > >
> > > Thanks Rafael,
> > > I'm doing my first steps with Drill so I'm still trying to figure out
> all
> > > the moving parts.
> > > I started with the master branch (1.18) Drill docker and just added
> the S3
> > > access plugin.
> > > Then I just submit the trivial query of asking one field from one file
> > > (trivial parquet file with few int/string fields) which has only 13
> rows
> > > (written with Parquet.Net) and has one row group.
> > > So I think I'm working as you suggested directly with the dfs directly.
> > > I've run the query several times.
> > > That is why I'm surprised from the high planning times for such a
> > > trivial scenario.
> > > Does the planning phase include accessing the file and analyzing the
> > > fields? (perhaps accessing S3 is taking the time?)
> > > If drill had the exact schema of the file (I thought this was the idea
> of
> > > the metastore), would it reduce the planning phase?
> > > Thanks,
> > > Avner
> > >
> > > On Sat, Jun 6, 2020 at 4:44 PM Rafael Jaimes III <rafjaimes@gmail.com>
> > > wrote:
> > >
> > >> Hi Avner,
> > >>
> > >> What do you mean by metastore, are you running it through a Hive
> metastore
> > >> and plugin?
> > >>
> > >> I would try to query against the dfs directly. I'm seeing much shorter
> > >> planning times than you with Drill 1.17 and no metastore. I don't
> usually
> > >> query a single file but I imagine that would be even faster.
> > >>
> > >> What program made the parquet file? Do you know what the row group
> size is
> > >> and is it the same as your HDFS block size? They should be for best
> > >> performance. Is the schema consistent within the file or do you have
> nested
> > >> fields?
> > >>
> > >> On Sat, Jun 6, 2020, 3:51 PM Avner Levy <avner.levy@gmail.com> wrote:
> > >>
> > >>> Hi Charles,
> > >>> I'm using master branch (1.18.0-SNAPSHOT) docker.
> > >>> I've enabled the metastore, session wise and run the same query
> twice but
> > >>> still got the following times.
> > >>> Is there a way to pre-define the table's schema in a way that will
> reduce
> > >>> the query time?
> > >>> The query is:
> > >>> *select name from `parquet/data.parquet` limit 1*
> > >>>
> > >>> Any idea why planning takes so long on such trivial query?
> > >>> Does it include accessing the file for schema discovery?
> > >>> I'm providing the specific filename in the queries in order to
> reduce the
> > >>> file listing part.
> > >>> Thanks for your help,
> > >>>  Avner
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090
> sec
> > >>> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled
> > >> true*
> > >>>
> > >>>
> > >>> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cgivre@gmail.com>
> wrote:
> > >>>
> > >>>> Hi Avner,
> > >>>> Maybe you said this already but what version of Drill are you using
> and
> > >>> do
> > >>>> you have the metastore enabled?
> > >>>> --C
> > >>>>
> > >>>>
> > >>>>
> > >>>>> On Jun 4, 2020, at 9:02 PM, Avner Levy <avner.levy@gmail.com>
> wrote:
> > >>>>>
> > >>>>> Thanks Rafael for your answer.
> > >>>>> As I wrote in the previous email these planning times occur
even
> when
> > >>>>> selecting one fields from one tiny file (60k) that I pass directly
> by
> > >>>> full
> > >>>>> path (select name from `parquet/data/data.parquet` limit 1).
> > >>>>> Any idea what can influence the time in such a trivial scenario?
> > >>>>> In addition, doesn't Drill cache execution plans between similar
> > >>> queries
> > >>>>> executions?
> > >>>>> Best regards,
> > >>>>> Avner
> > >>>>>
> > >>>>>
> > >>>>> On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <
> > >> rafjaimes@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi Avner,
> > >>>>>>
> > >>>>>> One way you might be able to optimize this is by modifying
the
> size
> > >>>>>> and number of the parquet files. How many files do you
have and
> how
> > >>>>>> big are they? Do you know what the row group size is? What
is the
> > >> HDFS
> > >>>>>> block size is on your storage?
> > >>>>>>
> > >>>>>> There's probably a lot more intricate ways to improve performance
> > >> with
> > >>>>>> the Drill settings, but I have not modified them.
> > >>>>>>
> > >>>>>> - Rafael
> > >>>>>>
> > >>>>>> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <avner.levy@gmail.com>
> > >>> wrote:
> > >>>>>>>
> > >>>>>>> I'm running Apache Drill (1.18 master branch) in a
docker with
> data
> > >>>>>> stored
> > >>>>>>> in Parquet files on S3.
> > >>>>>>> When I run queries, even the most simple ones such
as:
> > >>>>>>>
> > >>>>>>> select name from `parquet/data/data.parquet` limit
1
> > >>>>>>>
> > >>>>>>> The "Planning" time is 0.7-1.5 sec while the "Execution"
is only
> > >>> 0.112
> > >>>>>> sec.
> > >>>>>>> These proportions are maintained even if I run the
same query
> > >>> multiple
> > >>>>>>> times in a row.
> > >>>>>>> Since I'm trying to minimize query times to a minimum,
I was
> > >>> wondering
> > >>>> if
> > >>>>>>> such planning times (compared to execution) make sense
and is
> there
> > >>> any
> > >>>>>> way
> > >>>>>>> to reduce it? (some plan caching mechanism)
> > >>>>>>> Thanks,
> > >>>>>>> Avner
> > >>>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message