drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: Partitioning for parquet
Date Thu, 01 Jun 2017 17:31:01 GMT
Looks like the default parquet block size is 512MB [1]. There is an ongoing
patch which may put one single parquet block into a single file system
block [2].

In general, if you are doing pruning over large number of small parquet
files, the filter evaluation during pruning may become a bottleneck. Unlike
filter evaluation in execution time which uses run-time generated code and
is more efficient, filter evaluation in planning time uses interpreter and
is less efficient. Partition pruning only makes sense when each partition
contains large number or rows.

1. https://drill.apache.org/docs/parquet-format/#configuring-
the-size-of-parquet-files
2. https://issues.apache.org/jira/browse/DRILL-5379


On Thu, Jun 1, 2017 at 6:55 AM, Raz Baluchi <raz.baluchi@gmail.com> wrote:

> I guess there is such a thing as over partitioning...
>
> The query on the table partitioned by date spends most of the elapsed time
> on the 'planning' phase, with the execution being roughly equal to the one
> on the table partitioned by year and month.
>
> Based on these results, I've added a third table which is partitioned
> simply by year. I've also added an ORDER BY to the CTAS in an attempt to
> sort the table by date.
>
> This third table seems to have the fastest query times so far with the
> least amount of 'planning'.  My take away from this exercise is to limit
> the partitioning to the minimum required to obtain parquet files in the
> range of 100 MB or so. Is that a valid lesson learned?
>
> On Thu, Jun 1, 2017 at 1:05 AM, Jinfeng Ni <jni@apache.org> wrote:
>
> > You may want to check if query on the second table is slower because of
> > planning time or execution time. That could be determined by looking at
> the
> > query profile in web-UI.
> >
> > Two factors might impact the planning time for second table having 11837:
> > 1. Reading parquet metadata from those parquet files.  Parquet metadata
> > cache file might help for the cases of large number of small files.
> > 2. Filter expression evaluation cost : query second would evaluate
> > expression 11837 times, vs just 410 times for first table.
> >
> > In general, if you have 100M rows in 11837 files, ==> that's about 8500
> > rows per file. Performance-wise, this does not seem to be a good choice
> for
> > parquet format.
> >
> >
> >
> > On Wed, May 31, 2017 at 9:33 PM, Padma Penumarthy <ppenumarthy@mapr.com>
> > wrote:
> >
> > > Are you running same query on both tables ? What is the filter
> condition
> > ?
> > > Since they are partitioned differently, same filter may prune the files
> > > differently.
> > > If possible, can you share query profiles ?
> > > You can check query profiles to see how many rows are being read from
> > disk
> > > in both cases.
> > >
> > > Thanks,
> > > Padma
> > >
> > >
> > > > On May 31, 2017, at 6:15 PM, Raz Baluchi <raz.baluchi@gmail.com>
> > wrote:
> > > >
> > > > As an experiment, I created an event file will 100 million entries
> > > spanning
> > > > 25 years. I then created tables both ways, one partitioned by year
> and
> > > > month and the other by date. The first table created 410 parquet
> files
> > > and
> > > > the second 11837.
> > > >
> > > > Querying the first table is consistently faster by a factor of 2x to
> > 10x,
> > > >
> > > > Is this because drill is not very efficient at querying a large
> number
> > of
> > > > small(ish) parquet files?
> > > >
> > > > On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
> > > > challapallirahul@gmail.com> wrote:
> > > >
> > > >> If most of your queries use date column in the filter condition, I
> > would
> > > >> partition the data on the date column. Then you can simply say
> > > >>
> > > >> select * from events where `date` between '2016-11-11' and
> > '2017-01-23';
> > > >>
> > > >> - Rahul
> > > >>
> > > >> On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi <raz.baluchi@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >>> So, if I understand you correctly, I would have to include the
'yr'
> > and
> > > >>> 'mnth' columns in addition to the 'date' column in the query?
> > > >>>
> > > >>> e.g.
> > > >>>
> > > >>> select * from events where yr in (2016, 2017)  and mnth in
> (11,12,1)
> > > and
> > > >>> date between '2016-11-11' and '2017-01-23';
> > > >>>
> > > >>> Is that correct?
> > > >>>
> > > >>> On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
> > > >>> challapallirahul@gmail.com> wrote:
> > > >>>
> > > >>>> How to partition data is dependent on how you want to access
your
> > > data.
> > > >>> If
> > > >>>> you can foresee that most of the queries use year and month,
then
> > > >>> go-ahead
> > > >>>> and partition the data on those 2 columns. You can do that
like
> > below
> > > >>>>
> > > >>>> create table partitioned_data partition by (yr, mnth) as select
> > > >>>> extract(year from `date`) yr, extract(month from `date`) mnth,
> > `date`,
> > > >>>> ........ from mydata;
> > > >>>>
> > > >>>> For partitioning to have any benefit, your queries should
have
> > filters
> > > >> on
> > > >>>> month and year columns.
> > > >>>>
> > > >>>> - Rahul
> > > >>>>
> > > >>>> On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <
> raz.baluchi@gmail.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>> Trying to understand parquet partitioning works.
> > > >>>>>
> > > >>>>> What is the recommended partitioning scheme for event
data that
> > will
> > > >> be
> > > >>>>> queried primarily by date. I assume that partitioning
by year and
> > > >> month
> > > >>>>> would be optimal?
> > > >>>>>
> > > >>>>> Lets say I have data that looks like:
> > > >>>>>
> > > >>>>> application,status,date,message
> > > >>>>> kafka,down,2017-03023 04:53,zookeeper is not available
> > > >>>>>
> > > >>>>>
> > > >>>>> Would I have to create new columns for year and month?
> > > >>>>>
> > > >>>>> e.g.
> > > >>>>> application,status,date,message,year,month
> > > >>>>> kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
> > > >>>>>
> > > >>>>> and then perform a CTAS using the year and month columns
as the
> > > >>>> 'partition
> > > >>>>> by'?
> > > >>>>>
> > > >>>>> Thanks
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message