drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Givre <cgi...@gmail.com>
Subject Re: Planning times
Date Mon, 08 Jun 2020 02:52:20 GMT
Hi Avner, 
So there is a way to define a schema both at query time and for a group of files. 

To add a schema at query time you can do the following:

SELECT * 
FROM table(dfs.`hdf5/non-scalar.h5` (type => 'hdf5', defaultPath => '/compound', schema
=> 'inline=(field_1 int not null, field_2 double not null, field_3 varchar not null, fixed_field
int not null default `20`)')) 
WHERE field_1 < 5

I think this will work for parquet files, but I'm not 100% certain.  
Secondly, there is also the Drill metastore.  You can read about that here:  http://drill.apache.org/docs/using-drill-metastore/
<http://drill.apache.org/docs/using-drill-metastore/>

Lastly, there is a way to provide an actual schema file for your parquet files but I'll have
to dig that up.  However, with all that said, I didn't realize that you were doing this on
S3. I'd concur with Rafael,  that you should try on your local filesystem first and see what
the query times look like.  Generally speaking, for a small file size like that, querying
a parquet file should be nearly instantaneous, with or without the schema or metastore.
Good luck!
-- C 


> On Jun 7, 2020, at 11:08 AM, Avner Levy <avner.levy@gmail.com> wrote:
> 
> Thanks Rafael,
> I'm doing my first steps with Drill so I'm still trying to figure out all
> the moving parts.
> I started with the master branch (1.18) Drill docker and just added the S3
> access plugin.
> Then I just submit the trivial query of asking one field from one file
> (trivial parquet file with few int/string fields) which has only 13 rows
> (written with Parquet.Net) and has one row group.
> So I think I'm working as you suggested directly with the dfs directly.
> I've run the query several times.
> That is why I'm surprised from the high planning times for such a
> trivial scenario.
> Does the planning phase include accessing the file and analyzing the
> fields? (perhaps accessing S3 is taking the time?)
> If drill had the exact schema of the file (I thought this was the idea of
> the metastore), would it reduce the planning phase?
> Thanks,
> Avner
> 
> On Sat, Jun 6, 2020 at 4:44 PM Rafael Jaimes III <rafjaimes@gmail.com>
> wrote:
> 
>> Hi Avner,
>> 
>> What do you mean by metastore, are you running it through a Hive metastore
>> and plugin?
>> 
>> I would try to query against the dfs directly. I'm seeing much shorter
>> planning times than you with Drill 1.17 and no metastore. I don't usually
>> query a single file but I imagine that would be even faster.
>> 
>> What program made the parquet file? Do you know what the row group size is
>> and is it the same as your HDFS block size? They should be for best
>> performance. Is the schema consistent within the file or do you have nested
>> fields?
>> 
>> On Sat, Jun 6, 2020, 3:51 PM Avner Levy <avner.levy@gmail.com> wrote:
>> 
>>> Hi Charles,
>>> I'm using master branch (1.18.0-SNAPSHOT) docker.
>>> I've enabled the metastore, session wise and run the same query twice but
>>> still got the following times.
>>> Is there a way to pre-define the table's schema in a way that will reduce
>>> the query time?
>>> The query is:
>>> *select name from `parquet/data.parquet` limit 1*
>>> 
>>> Any idea why planning takes so long on such trivial query?
>>> Does it include accessing the file for schema discovery?
>>> I'm providing the specific filename in the queries in order to reduce the
>>> file listing part.
>>> Thanks for your help,
>>>  Avner
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *DurationPlanning  Queued   Execution Total0.683 sec 0.000 sec 0.090 sec
>>> 0.773 secOptions Overview Session OptionsName Valuemetastore.enabled
>> true*
>>> 
>>> 
>>> On Thu, Jun 4, 2020 at 9:09 PM Charles Givre <cgivre@gmail.com> wrote:
>>> 
>>>> Hi Avner,
>>>> Maybe you said this already but what version of Drill are you using and
>>> do
>>>> you have the metastore enabled?
>>>> --C
>>>> 
>>>> 
>>>> 
>>>>> On Jun 4, 2020, at 9:02 PM, Avner Levy <avner.levy@gmail.com> wrote:
>>>>> 
>>>>> Thanks Rafael for your answer.
>>>>> As I wrote in the previous email these planning times occur even when
>>>>> selecting one fields from one tiny file (60k) that I pass directly by
>>>> full
>>>>> path (select name from `parquet/data/data.parquet` limit 1).
>>>>> Any idea what can influence the time in such a trivial scenario?
>>>>> In addition, doesn't Drill cache execution plans between similar
>>> queries
>>>>> executions?
>>>>> Best regards,
>>>>> Avner
>>>>> 
>>>>> 
>>>>> On Thu, Jun 4, 2020 at 2:55 PM Rafael Jaimes III <
>> rafjaimes@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Avner,
>>>>>> 
>>>>>> One way you might be able to optimize this is by modifying the size
>>>>>> and number of the parquet files. How many files do you have and how
>>>>>> big are they? Do you know what the row group size is? What is the
>> HDFS
>>>>>> block size is on your storage?
>>>>>> 
>>>>>> There's probably a lot more intricate ways to improve performance
>> with
>>>>>> the Drill settings, but I have not modified them.
>>>>>> 
>>>>>> - Rafael
>>>>>> 
>>>>>> On Thu, Jun 4, 2020 at 2:43 PM Avner Levy <avner.levy@gmail.com>
>>> wrote:
>>>>>>> 
>>>>>>> I'm running Apache Drill (1.18 master branch) in a docker with
data
>>>>>> stored
>>>>>>> in Parquet files on S3.
>>>>>>> When I run queries, even the most simple ones such as:
>>>>>>> 
>>>>>>> select name from `parquet/data/data.parquet` limit 1
>>>>>>> 
>>>>>>> The "Planning" time is 0.7-1.5 sec while the "Execution" is only
>>> 0.112
>>>>>> sec.
>>>>>>> These proportions are maintained even if I run the same query
>>> multiple
>>>>>>> times in a row.
>>>>>>> Since I'm trying to minimize query times to a minimum, I was
>>> wondering
>>>> if
>>>>>>> such planning times (compared to execution) make sense and is
there
>>> any
>>>>>> way
>>>>>>> to reduce it? (some plan caching mechanism)
>>>>>>> Thanks,
>>>>>>> Avner
>>>>>> 
>>>> 
>>>> 
>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message