drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: Query Planning and Directory Pruning
Date Tue, 09 Feb 2016 17:38:12 GMT
I am sorry that wasn't clearer, I am not sure why I used those examples,
but if have a table

table
table/2016-02-04
table/2016-02-03
...
table/2015-07-01


(so a subdirectory for each day for half a year)


then I make a new empty directory of

table1
table1/2016-02-04
table1/2016-02-03

(only two subdirectories that I copied EXACTLY from table)

select count(1) from table1 where dir0 = '2016-02-04'
returns the count in 0.313 seconds

select count(1) from table where dir0 = '2016-02-04'
returns the same exact count in 9.879  seconds

So yes, this applies to the dir pruning issue.

Thanks!

John

On Tue, Feb 9, 2016 at 11:02 AM, Aman Sinha <amansinha@apache.org> wrote:

> At a glance, John's query does not have a WHERE clause..it is querying the
> subdirectory directly in the FROM clause..in this case Drill will only look
> at the files within that subdirectory.  Directory pruning only comes into
> the picture when there is a WHERE condition on dir0, dir1 etc.
>
>
> On Tue, Feb 9, 2016 at 8:25 AM, Neeraja Rentachintala <
> nrentachintala@maprtech.com> wrote:
>
> > Yes, Drill-3759 covers it.
> > This is a high priority enhancement that we are trying to get to in the
> > next couple of releases.
> >
> > -Neeraja
> >
> > On Tue, Feb 9, 2016 at 7:32 AM, John Omernik <john@omernik.com> wrote:
> >
> > > This one seems to cover it:
> > >
> > > https://issues.apache.org/jira/browse/DRILL-3759
> > >
> > >
> > >
> > > On Tue, Feb 9, 2016 at 9:25 AM, Abdel Hakim Deneche <
> > adeneche@maprtech.com
> > > >
> > > wrote:
> > >
> > > > Hi John,
> > > >
> > > > Sorry I didn't get back to you (I thought I did).
> > > >
> > > > No, I don't need the plan, I just wanted to confirm what was taking
> > most
> > > of
> > > > the time and you already confirmed it's the planning.
> > > >
> > > > Can you open a JIRA for this ? this may be a known issue, but I'm not
> > > sure.
> > > >
> > > > Thanks
> > > >
> > > > On Tue, Feb 9, 2016 at 6:08 AM, John Omernik <john@omernik.com>
> wrote:
> > > >
> > > > > Abdel, do you still need the plans, as I said, if your table has
> any
> > > > decent
> > > > > amount of directories and files, it looks like the planning is
> > touching
> > > > all
> > > > > the directories even though you are pruning.  I can post plans,
> > > however,
> > > > I
> > > > > think in this case you'll find they are exactly the same, and the
> > only
> > > > > difference is that the longer queries is planning much more because
> > it
> > > > has
> > > > > more files to read.
> > > > >
> > > > >
> > > > > On Thu, Feb 4, 2016 at 10:46 AM, John Omernik <john@omernik.com>
> > > wrote:
> > > > >
> > > > > > I can package up both plans for you if you need them (let me
know
> > if
> > > > you
> > > > > > still want them) but I can tell you the plans were EXACTLY the
> > same,
> > > > > > however the data-sum table took 0.932 seconds to plan the query,
> > and
> > > > the
> > > > > > data table (the one with the all the extra data) took 11.379
> > seconds
> > > to
> > > > > > plan the query. Indicating to me the issue isn't in the plan
that
> > was
> > > > > > created, but the actual planning process. (Let me know if you
> > > disagree
> > > > or
> > > > > > still need to see the plan, like I said, the actual plans were
> > > exactly
> > > > > the
> > > > > > same)
> > > > > >
> > > > > >
> > > > > > John.
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche <
> > > > > > adeneche@maprtech.com> wrote:
> > > > > >
> > > > > >> Hey John, can you try an explain plan for both queries and
see
> how
> > > > much
> > > > > >> times it takes ?
> > > > > >>
> > > > > >> for example, for the first query you would run:
> > > > > >>
> > > > > >> *explain plan for* select count(1) from `data/2016-02-03`;
> > > > > >>
> > > > > >> It can also be helpful if you could share the query profiles
for
> > > both
> > > > > >> queries.
> > > > > >>
> > > > > >> Thanks
> > > > > >>
> > > > > >> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <john@omernik.com>
> > > > wrote:
> > > > > >>
> > > > > >> > Hey all, I think am I seeing an issue related to
> > > > > >> > https://issues.apache.org/jira/browse/DRILL-3759 but
I want
> to
> > > > > >> describe it
> > > > > >> > out here, see if it's really the case, and then determine
what
> > the
> > > > > >> blockers
> > > > > >> > may be to resolution.
> > > > > >> >
> > > > > >> > I am using the MapR Developer Release 1.4, and I have
a
> > directory
> > > > with
> > > > > >> > subdirectories by data.
> > > > > >> >
> > > > > >> > data/2015-01-01
> > > > > >> > data/2015-01-02
> > > > > >> > data/2015-01-03
> > > > > >> >
> > > > > >> > These are stored as Parquet files.  At this point Each
data
> > > averages
> > > > > >> about
> > > > > >> > 1 GB of data, and has roughly 75 parquet files in it.
> > > > > >> >
> > > > > >> > When I run
> > > > > >> >
> > > > > >> > select count(1) from `data/2016-02-03` it takes roughly
11
> > > seconds.
> > > > > >> >
> > > > > >> > If I copy the 2016-02-03 directory to a new base (date-sum)
> and
> > > run
> > > > > >> >
> > > > > >> > select count(1) from `data_sum/2016-02-03` it runs
in 0.874
> > > seconds.
> > > > > >> >
> > > > > >> > Same data, same structure, only difference is the data_sum
> > > directory
> > > > > >> only
> > > > > >> > has a few directories, iand data has dates going back
to Nov
> > 2015.
> > > > It
> > > > > >> > seems like it is getting files name for all files in
each
> > > directory
> > > > > >> prior
> > > > > >> > to pruning which seems to me to be adding a lot of
latency to
> > > > queries
> > > > > >> that
> > > > > >> > doesn't need to be there.  (thus I think I am seeing
3759)
> but I
> > > > > wanted
> > > > > >> to
> > > > > >> > confirm, and then I wanted to see how we can address
this in
> > that
> > > > the
> > > > > >> > directory prune should be fast, and on large data sets
its
> just
> > > > going
> > > > > to
> > > > > >> > get worse and worse.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > John
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >>
> > > > > >> Abdelhakim Deneche
> > > > > >>
> > > > > >> Software Engineer
> > > > > >>
> > > > > >>   <http://www.mapr.com/>
> > > > > >>
> > > > > >>
> > > > > >> Now Available - Free Hadoop On-Demand Training
> > > > > >> <
> > > > > >>
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   <http://www.mapr.com/>
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message