Hey John, can you try an explain plan for both queries and see how much
times it takes ?
for example, for the first query you would run:
*explain plan for* select count(1) from `data/2016-02-03`;
It can also be helpful if you could share the query profiles for both
queries.
Thanks
On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <john@omernik.com> wrote:
> Hey all, I think am I seeing an issue related to
> https://issues.apache.org/jira/browse/DRILL-3759 but I want to describe it
> out here, see if it's really the case, and then determine what the blockers
> may be to resolution.
>
> I am using the MapR Developer Release 1.4, and I have a directory with
> subdirectories by data.
>
> data/2015-01-01
> data/2015-01-02
> data/2015-01-03
>
> These are stored as Parquet files. At this point Each data averages about
> 1 GB of data, and has roughly 75 parquet files in it.
>
> When I run
>
> select count(1) from `data/2016-02-03` it takes roughly 11 seconds.
>
> If I copy the 2016-02-03 directory to a new base (date-sum) and run
>
> select count(1) from `data_sum/2016-02-03` it runs in 0.874 seconds.
>
> Same data, same structure, only difference is the data_sum directory only
> has a few directories, iand data has dates going back to Nov 2015. It
> seems like it is getting files name for all files in each directory prior
> to pruning which seems to me to be adding a lot of latency to queries that
> doesn't need to be there. (thus I think I am seeing 3759) but I wanted to
> confirm, and then I wanted to see how we can address this in that the
> directory prune should be fast, and on large data sets its just going to
> get worse and worse.
>
>
>
> John
>
--
Abdelhakim Deneche
Software Engineer
<http://www.mapr.com/>
Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
|