drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <jinfengn...@gmail.com>
Subject Re: Query Planning and Directory Pruning
Date Tue, 09 Feb 2016 19:32:26 GMT
Hi John,

I think the patch for DRILL-2517 [1] would help a bit, if you use
parquet files. DRILL-2517 would save the overhead of reading parquet
metadata from parquet footer, by pruning the directory first.  (It
list some preliminary performance results, similar to the setup you
had )

However, there are at least two follow-up tasks that has to be
completed, before we could say that this performance problem is
solved. The major parts of planning is spent in three components: 1)
get the list of file status under the directory 2) get the parquet
metadata, 3) evaluating the filter in planning rule. DRILL-2517
address the 2nd part partially.  In my local run, the 2nd is the
biggest consumer of planning time. We still have to improve 3rd part,
by doing something related to DRILL-3759. Only after that's done, we
should see no big difference between your two queries.


[1] https://issues.apache.org/jira/browse/DRILL-2517

On Tue, Feb 9, 2016 at 9:38 AM, John Omernik <john@omernik.com> wrote:
> I am sorry that wasn't clearer, I am not sure why I used those examples,
> but if have a table
>
> table
> table/2016-02-04
> table/2016-02-03
> ...
> table/2015-07-01
>
>
> (so a subdirectory for each day for half a year)
>
>
> then I make a new empty directory of
>
> table1
> table1/2016-02-04
> table1/2016-02-03
>
> (only two subdirectories that I copied EXACTLY from table)
>
> select count(1) from table1 where dir0 = '2016-02-04'
> returns the count in 0.313 seconds
>
> select count(1) from table where dir0 = '2016-02-04'
> returns the same exact count in 9.879  seconds
>
> So yes, this applies to the dir pruning issue.
>
> Thanks!
>
> John
>
> On Tue, Feb 9, 2016 at 11:02 AM, Aman Sinha <amansinha@apache.org> wrote:
>
>> At a glance, John's query does not have a WHERE clause..it is querying the
>> subdirectory directly in the FROM clause..in this case Drill will only look
>> at the files within that subdirectory.  Directory pruning only comes into
>> the picture when there is a WHERE condition on dir0, dir1 etc.
>>
>>
>> On Tue, Feb 9, 2016 at 8:25 AM, Neeraja Rentachintala <
>> nrentachintala@maprtech.com> wrote:
>>
>> > Yes, Drill-3759 covers it.
>> > This is a high priority enhancement that we are trying to get to in the
>> > next couple of releases.
>> >
>> > -Neeraja
>> >
>> > On Tue, Feb 9, 2016 at 7:32 AM, John Omernik <john@omernik.com> wrote:
>> >
>> > > This one seems to cover it:
>> > >
>> > > https://issues.apache.org/jira/browse/DRILL-3759
>> > >
>> > >
>> > >
>> > > On Tue, Feb 9, 2016 at 9:25 AM, Abdel Hakim Deneche <
>> > adeneche@maprtech.com
>> > > >
>> > > wrote:
>> > >
>> > > > Hi John,
>> > > >
>> > > > Sorry I didn't get back to you (I thought I did).
>> > > >
>> > > > No, I don't need the plan, I just wanted to confirm what was taking
>> > most
>> > > of
>> > > > the time and you already confirmed it's the planning.
>> > > >
>> > > > Can you open a JIRA for this ? this may be a known issue, but I'm
not
>> > > sure.
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Tue, Feb 9, 2016 at 6:08 AM, John Omernik <john@omernik.com>
>> wrote:
>> > > >
>> > > > > Abdel, do you still need the plans, as I said, if your table
has
>> any
>> > > > decent
>> > > > > amount of directories and files, it looks like the planning is
>> > touching
>> > > > all
>> > > > > the directories even though you are pruning.  I can post plans,
>> > > however,
>> > > > I
>> > > > > think in this case you'll find they are exactly the same, and
the
>> > only
>> > > > > difference is that the longer queries is planning much more because
>> > it
>> > > > has
>> > > > > more files to read.
>> > > > >
>> > > > >
>> > > > > On Thu, Feb 4, 2016 at 10:46 AM, John Omernik <john@omernik.com>
>> > > wrote:
>> > > > >
>> > > > > > I can package up both plans for you if you need them (let
me know
>> > if
>> > > > you
>> > > > > > still want them) but I can tell you the plans were EXACTLY
the
>> > same,
>> > > > > > however the data-sum table took 0.932 seconds to plan the
query,
>> > and
>> > > > the
>> > > > > > data table (the one with the all the extra data) took 11.379
>> > seconds
>> > > to
>> > > > > > plan the query. Indicating to me the issue isn't in the
plan that
>> > was
>> > > > > > created, but the actual planning process. (Let me know if
you
>> > > disagree
>> > > > or
>> > > > > > still need to see the plan, like I said, the actual plans
were
>> > > exactly
>> > > > > the
>> > > > > > same)
>> > > > > >
>> > > > > >
>> > > > > > John.
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche <
>> > > > > > adeneche@maprtech.com> wrote:
>> > > > > >
>> > > > > >> Hey John, can you try an explain plan for both queries
and see
>> how
>> > > > much
>> > > > > >> times it takes ?
>> > > > > >>
>> > > > > >> for example, for the first query you would run:
>> > > > > >>
>> > > > > >> *explain plan for* select count(1) from `data/2016-02-03`;
>> > > > > >>
>> > > > > >> It can also be helpful if you could share the query
profiles for
>> > > both
>> > > > > >> queries.
>> > > > > >>
>> > > > > >> Thanks
>> > > > > >>
>> > > > > >> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <john@omernik.com>
>> > > > wrote:
>> > > > > >>
>> > > > > >> > Hey all, I think am I seeing an issue related to
>> > > > > >> > https://issues.apache.org/jira/browse/DRILL-3759
but I want
>> to
>> > > > > >> describe it
>> > > > > >> > out here, see if it's really the case, and then
determine what
>> > the
>> > > > > >> blockers
>> > > > > >> > may be to resolution.
>> > > > > >> >
>> > > > > >> > I am using the MapR Developer Release 1.4, and
I have a
>> > directory
>> > > > with
>> > > > > >> > subdirectories by data.
>> > > > > >> >
>> > > > > >> > data/2015-01-01
>> > > > > >> > data/2015-01-02
>> > > > > >> > data/2015-01-03
>> > > > > >> >
>> > > > > >> > These are stored as Parquet files.  At this point
Each data
>> > > averages
>> > > > > >> about
>> > > > > >> > 1 GB of data, and has roughly 75 parquet files
in it.
>> > > > > >> >
>> > > > > >> > When I run
>> > > > > >> >
>> > > > > >> > select count(1) from `data/2016-02-03` it takes
roughly 11
>> > > seconds.
>> > > > > >> >
>> > > > > >> > If I copy the 2016-02-03 directory to a new base
(date-sum)
>> and
>> > > run
>> > > > > >> >
>> > > > > >> > select count(1) from `data_sum/2016-02-03` it runs
in 0.874
>> > > seconds.
>> > > > > >> >
>> > > > > >> > Same data, same structure, only difference is the
data_sum
>> > > directory
>> > > > > >> only
>> > > > > >> > has a few directories, iand data has dates going
back to Nov
>> > 2015.
>> > > > It
>> > > > > >> > seems like it is getting files name for all files
in each
>> > > directory
>> > > > > >> prior
>> > > > > >> > to pruning which seems to me to be adding a lot
of latency to
>> > > > queries
>> > > > > >> that
>> > > > > >> > doesn't need to be there.  (thus I think I am seeing
3759)
>> but I
>> > > > > wanted
>> > > > > >> to
>> > > > > >> > confirm, and then I wanted to see how we can address
this in
>> > that
>> > > > the
>> > > > > >> > directory prune should be fast, and on large data
sets its
>> just
>> > > > going
>> > > > > to
>> > > > > >> > get worse and worse.
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > John
>> > > > > >> >
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> --
>> > > > > >>
>> > > > > >> Abdelhakim Deneche
>> > > > > >>
>> > > > > >> Software Engineer
>> > > > > >>
>> > > > > >>   <http://www.mapr.com/>
>> > > > > >>
>> > > > > >>
>> > > > > >> Now Available - Free Hadoop On-Demand Training
>> > > > > >> <
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > > > > >> >
>> > > > > >>
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > > Abdelhakim Deneche
>> > > >
>> > > > Software Engineer
>> > > >
>> > > >   <http://www.mapr.com/>
>> > > >
>> > > >
>> > > > Now Available - Free Hadoop On-Demand Training
>> > > > <
>> > > >
>> > >
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > > > >
>> > > >
>> > >
>> >
>>

Mime
View raw message