drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Jaimes III <rafjai...@gmail.com>
Subject Re: Planning times
Date Wed, 10 Jun 2020 16:03:56 GMT
Parquet uses AVRO schema file, but I think that might be going down
the wrong path. Drill's schema discovery should be really quick unless
you have a really weird schema or non-consistent schema.

It may indeed have something to do with S3 or data/network access
latency. It's very straightforward to deploy a local embedded instance
of Drill by using the embedded drillbit on a Linux machine. You can
also use the Docker image from dockerhub if you are more comfortable
with that. It takes less than 5 minutes to go from zero to querying a
local parquet file you can use to test. You'll see the planning times
significantly less on your local machine. Although the plugin is
called "dfs" for HDFS, it works just fine on a regular drive and
filesystem.

Let us know if you see the improvement with a local query. If not, it
may indicate something sub-optimal with your Parquet file. If yes, it
would narrow down something with S3 as the limitation.

I also recommend 1.17, unless there is something specific about 1.18
that you are using.

On Sun, Jun 7, 2020 at 10:53 PM Charles Givre <cgivre@gmail.com> wrote:
>
> Hi Avner,
> So there is a way to define a schema both at query time and for a group of files.
>
> To add a schema at query time you can do the following:
>
> SELECT *
> FROM table(dfs.`hdf5/non-scalar.h5` (type => 'hdf5', defaultPath => '/compound',
schema => 'inline=(field_1 int not null, field_2 double not null, field_3 varchar not null,
fixed_field int not null default `20`)'))
> WHERE field_1 < 5
>
> I think this will work for parquet files, but I'm not 100% certain.
> Secondly, there is also the Drill metastore.  You can read about that here:  http://drill.apache.org/docs/using-drill-metastore/
<http://drill.apache.org/docs/using-drill-metastore/>
>
> Lastly, there is a way to provide an actual schema file for your parquet files but I'll
have to dig that up.  However, with all that said, I didn't realize that you were doing this
on S3. I'd concur with Rafael,  that you should try on your local filesystem first and see
what the query times look like.  Generally speaking, for a small file size like that, querying
a parquet file should be nearly instantaneous, with or without the schema or metastore.
> Good luck!
> -- C
>
>
> > On Jun 7, 2020, at 11:08 AM, Avner Levy <avner.levy@gmail.com> wrote:
> >
> > Thanks Rafael,
> > I'm doing my first steps with Drill so I'm still trying to figure out all
> > the moving parts.
> > I started with the master branch (1.18) Drill docker and just added the S3
> > access plugin.
> > Then I just submit the trivial query of asking one field from one file
> > (trivial parquet file with few int/string fields) which has only 13 rows
> > (written with Parquet.Net) and has one row group.
> > So I think I'm working as you suggested directly with the dfs directly.
> > I've run the query several times.
> > That is why I'm surprised from the high planning times for such a
> > trivial scenario.
> > Does the planning phase include accessing the file and analyzing the
> > fields? (perhaps accessing S3 is taking the time?)
> > If drill had the exact schema of the file (I thought this was the idea of
> > the metastore), would it reduce the planning phase?
> > Thanks,
> > Avner
> >
> > On Sat, Jun 6, 2020 at 4:44 PM Rafael Jaimes III <rafjaimes@gmail.com>
> > wrote:
> >
> >> Hi Avner,
> >>
> >> What do you mean by metastore, are you running it through a Hive metastore
> >> and plugin?
> >>
> >> I would try to query against the dfs directly. I'm seeing much shorter
> >> planning times than you with Drill 1.17 and no metastore. I don't usually
> >> query a single file but I imagine that would be even faster.
> >>
> >> What program made the parquet file? Do you know what the row group size is
> >> and is it the same as your HDFS block size? They should be for best
> >> performance. Is the schema consistent within the file or do you have nested
> >> fields?
> >>
> >> On Sat, Jun 6, 2020, 3:51 PM Avner Levy <avner.levy@gmail.com> wrote:
> >>
> >>> Hi Charles,
> >>> I'm using master branch (1.18.0-SNAPSHOT) docker.
> >>> I've enabled the metastore, session wise and run the same query twice but
> >>> still got the following times.
> >>> Is there a way to pre-define the table's schema in a way that will reduce
> >>> the query time?
> >>> The query is:
> >>> *select name from `parquet/data.parquet` limit 1*
> >>>
> >>> Any idea why planning takes so long on such trivial query?
> >>> Does it include accessing the file for schema discovery?
> >>> I'm providing the specific filename in the queries in order to reduce the
> >>> file listing part.
> >>> Thanks for your help,
> >>>  Avner
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
> >>> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled
> >> true*
> >>>
> >>>
> >>> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cgivre@gmail.com> wrote:
> >>>
> >>>> Hi Avner,
> >>>> Maybe you said this already but what version of Drill are you using
and
> >>> do
> >>>> you have the metastore enabled?
> >>>> --C
> >>>>
> >>>>
> >>>>
> >>>>> On Jun 4, 2020, at 9:02 PM, Avner Levy <avner.levy@gmail.com>
wrote:
> >>>>>
> >>>>> Thanks Rafael for your answer.
> >>>>> As I wrote in the previous email these planning times occur even
when
> >>>>> selecting one fields from one tiny file (60k) that I pass directly
by
> >>>> full
> >>>>> path (select name from `parquet/data/data.parquet` limit 1).
> >>>>> Any idea what can influence the time in such a trivial scenario?
> >>>>> In addition, doesn't Drill cache execution plans between similar
> >>> queries
> >>>>> executions?
> >>>>> Best regards,
> >>>>> Avner
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <
> >> rafjaimes@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Avner,
> >>>>>>
> >>>>>> One way you might be able to optimize this is by modifying the
size
> >>>>>> and number of the parquet files. How many files do you have
and how
> >>>>>> big are they? Do you know what the row group size is? What is
the
> >> HDFS
> >>>>>> block size is on your storage?
> >>>>>>
> >>>>>> There's probably a lot more intricate ways to improve performance
> >> with
> >>>>>> the Drill settings, but I have not modified them.
> >>>>>>
> >>>>>> - Rafael
> >>>>>>
> >>>>>> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <avner.levy@gmail.com>
> >>> wrote:
> >>>>>>>
> >>>>>>> I'm running Apache Drill (1.18 master branch) in a docker
with data
> >>>>>> stored
> >>>>>>> in Parquet files on S3.
> >>>>>>> When I run queries, even the most simple ones such as:
> >>>>>>>
> >>>>>>> select name from `parquet/data/data.parquet` limit 1
> >>>>>>>
> >>>>>>> The "Planning" time is 0.7-1.5 sec while the "Execution"
is only
> >>> 0.112
> >>>>>> sec.
> >>>>>>> These proportions are maintained even if I run the same
query
> >>> multiple
> >>>>>>> times in a row.
> >>>>>>> Since I'm trying to minimize query times to a minimum, I
was
> >>> wondering
> >>>> if
> >>>>>>> such planning times (compared to execution) make sense and
is there
> >>> any
> >>>>>> way
> >>>>>>> to reduce it? (some plan caching mechanism)
> >>>>>>> Thanks,
> >>>>>>> Avner
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>

Mime
View raw message