drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: Query Planning and Directory Pruning
Date Tue, 09 Feb 2016 14:08:11 GMT
Abdel, do you still need the plans, as I said, if your table has any decent
amount of directories and files, it looks like the planning is touching all
the directories even though you are pruning.  I can post plans, however, I
think in this case you'll find they are exactly the same, and the only
difference is that the longer queries is planning much more because it has
more files to read.


On Thu, Feb 4, 2016 at 10:46 AM, John Omernik <john@omernik.com> wrote:

> I can package up both plans for you if you need them (let me know if you
> still want them) but I can tell you the plans were EXACTLY the same,
> however the data-sum table took 0.932 seconds to plan the query, and the
> data table (the one with the all the extra data) took 11.379 seconds to
> plan the query. Indicating to me the issue isn't in the plan that was
> created, but the actual planning process. (Let me know if you disagree or
> still need to see the plan, like I said, the actual plans were exactly the
> same)
>
>
> John.
>
>
> On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche <
> adeneche@maprtech.com> wrote:
>
>> Hey John, can you try an explain plan for both queries and see how much
>> times it takes ?
>>
>> for example, for the first query you would run:
>>
>> *explain plan for* select count(1) from `data/2016-02-03`;
>>
>> It can also be helpful if you could share the query profiles for both
>> queries.
>>
>> Thanks
>>
>> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <john@omernik.com> wrote:
>>
>> > Hey all, I think am I seeing an issue related to
>> > https://issues.apache.org/jira/browse/DRILL-3759 but I want to
>> describe it
>> > out here, see if it's really the case, and then determine what the
>> blockers
>> > may be to resolution.
>> >
>> > I am using the MapR Developer Release 1.4, and I have a directory with
>> > subdirectories by data.
>> >
>> > data/2015-01-01
>> > data/2015-01-02
>> > data/2015-01-03
>> >
>> > These are stored as Parquet files.  At this point Each data averages
>> about
>> > 1 GB of data, and has roughly 75 parquet files in it.
>> >
>> > When I run
>> >
>> > select count(1) from `data/2016-02-03` it takes roughly 11 seconds.
>> >
>> > If I copy the 2016-02-03 directory to a new base (date-sum) and run
>> >
>> > select count(1) from `data_sum/2016-02-03` it runs in 0.874 seconds.
>> >
>> > Same data, same structure, only difference is the data_sum directory
>> only
>> > has a few directories, iand data has dates going back to Nov 2015.  It
>> > seems like it is getting files name for all files in each directory
>> prior
>> > to pruning which seems to me to be adding a lot of latency to queries
>> that
>> > doesn't need to be there.  (thus I think I am seeing 3759) but I wanted
>> to
>> > confirm, and then I wanted to see how we can address this in that the
>> > directory prune should be fast, and on large data sets its just going to
>> > get worse and worse.
>> >
>> >
>> >
>> > John
>> >
>>
>>
>>
>> --
>>
>> Abdelhakim Deneche
>>
>> Software Engineer
>>
>>   <http://www.mapr.com/>
>>
>>
>> Now Available - Free Hadoop On-Demand Training
>> <
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message