drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Drill where clause vs Hive on non-partition column
Date Wed, 16 Nov 2016 19:26:24 GMT
Assume you have a hive table "revenue" partitioned by year. Then the folder
structure for the table on maprfs/hdfs looks something like below


*revenue*
*|----year=2015*
*|----year=2016*
*|----year=2107*

Now if you want to leverage partition pruning, you can use something like
below
Hive plugin : select count(*) from hive.revenue where `year` = 2015
DFS plugin : select count(*) from dfs.`/user/hive/warehouse/revenue` where
dir0='year=2015'

I am not sure if we have a jira for tracking parquet filter pushdown when
using hive + native parquet reader

- Rahul

On Wed, Nov 16, 2016 at 7:24 AM, Sonny Heer <sonnyheer@gmail.com> wrote:

> thats a lot of good information Rahul!! - thanks.
>
> "modify the query to take advantage of drill's directory based
> partitioning"
>
> What does this entail?  Do you have to tell it on which column the
> directories are partitioned by?
>
> I think option 3 is probably the way to go.  Is there a ticket tracking
> work on this?
>
> Thanks again
>
> On Tue, Nov 15, 2016 at 10:25 AM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Robert's suggestion is with using the DFS plugin. If you directly use DFS
> > instead of hive plugin then
> >
> > 1. DFS plugin has to determine the underlying data format on the fly.
> > 2. DFS plugin does not know the schema in advance. But in the case
> parquet
> > drill would get this information from the parquet metadata. However if
> the
> > hive table is backed by a csv file, then you cast the columns
> appropriately
> > in the query or create a view.
> > 3. If the underlying hive table is partitioned, then drill does not know
> > anything about partitions. However since hive partitions are just
> > sub-directories, you can still modify the query to take advantage of
> > drill's directory based partitioning
> > 4. In terms of performance, I am not aware of any published benchmarks
> > comparing hive plugin and dfs plugin for parquet format. But from my
> > general experience it appears as though DFS plugin is faster.
> >
> > Also do not forget the 3rd option in my first response (Hive Plugin +
> Drill
> > native parquet reader). We do have plans to support filter pushdown for
> > this scenario in the future.
> >
> > - Rahul
> >
> > On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer <sonnyheer@gmail.com> wrote:
> >
> > > Thanks Robert.
> > >
> > > "You can then use Drill to query the Hive table and get predicate
> > pushdown"
> > >
> > > This is using the DFS plugin and going directly to the hive table
> folder?
> > >
> > > Can someone speak to what advantages there are to use the hive plugin
> vs
> > > going directly to dfs
> > >
> > > On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou <rhou@maprtech.com>
> wrote:
> > >
> > > > I have used Hive 1.2 and I have found that the stats in parquet files
> > are
> > > > populated for some data types.  Integer, bigint, float, double, date
> > > work.
> > > > String does not seem to work.
> > > >
> > > > You can then use Drill to query the Hive table and get predicate
> > pushdown
> > > > for simple compare filters.  This has the form "where col = value".
> > > Other
> > > > standard operators are !=, <, <=, >, >=.  Compound filters
can use
> > > "and/or"
> > > > logic.  This will be supported in Drill 1.9.
> > > >
> > > > In the future, we will add expressions and functions.
> > > >
> > > > Thanks.
> > > >
> > > > --Robert
> > > >
> > > >
> > > > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer <sonnyheer@gmail.com>
> > wrote:
> > > >
> > > > > Is there a way to do that during the creation of the parquet table?
> > > > Might
> > > > > be a hive question but all we do is 'STORED AS parquet' and then
> > during
> > > > > insert set the parquet.* properties.  I'm just trying to see if #2
> is
> > > an
> > > > > option for us to utilize filter pushdown via dfs
> > > > >
> > > > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli <
> > > > > challapallirahul@gmail.com> wrote:
> > > > >
> > > > > > I do not know of any plans to support filter pushdown when using
> > the
> > > > hive
> > > > > > plugin.
> > > > > > If you run analyze stats then hive computes the table stats
and
> > > stores
> > > > > them
> > > > > > in the hive metastore for the relevant table. I believe drill
> uses
> > > some
> > > > > of
> > > > > > these stats. However running analyze stats command does not
> > alter(or
> > > > add)
> > > > > > the metadata in the parquet files themselves. The parquet level
> > > > metadata
> > > > > > should be written when the parquet file itself is created in
the
> > > first
> > > > > > place.
> > > > > >
> > > > > > - Rahul
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <sonnyheer@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Rahul,
> > > > > > >
> > > > > > > Thanks for the details.  Is there any plans to support
filter
> > > > pushdown
> > > > > > for
> > > > > > > #1?  Do you know if we run analyze stats through hive on
a
> > parquet
> > > > file
> > > > > > if
> > > > > > > that will have enough info to do the pushdown?
> > > > > > >
> > > > > > > Thanks again.
> > > > > > >
> > > > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
> > > > > > > challapallirahul@gmail.com> wrote:
> > > > > > >
> > > > > > > > Sonny,
> > > > > > > >
> > > > > > > > If the underlying data in the hive table is in parquet
> format,
> > > > there
> > > > > > are
> > > > > > > 3
> > > > > > > > ways to query from drill :
> > > > > > > >
> > > > > > > > 1. Using the hive plugin : This does not support filter
> > pushdown
> > > > for
> > > > > > any
> > > > > > > > formats (ORC, Parquet, Text...etc)
> > > > > > > > 2. Directly Querying the folder in maprfs/hdfs which
contains
> > the
> > > > > > parquet
> > > > > > > > files using DFS plugin: With DRILL-1950, we can now
do a
> filter
> > > > > > pushdown
> > > > > > > > into the parquet files. In order to take advantage
of this
> > > feature,
> > > > > the
> > > > > > > > underlying parquet files should have the relevant
stats. This
> > > > feature
> > > > > > > will
> > > > > > > > only be available with the 1.9.0 release
> > > > > > > > 3. Using the drill's native parquet reader in conjunction
> with
> > > the
> > > > > hive
> > > > > > > > plugin (See store.hive.optimize_scan_with_native_readers)
:
> > This
> > > > > > allows
> > > > > > > > drill to fetch all the metadata about the hive table
from the
> > > > > metastore
> > > > > > > and
> > > > > > > > then drill uses its own parquet reader for actually
reading
> the
> > > > > files.
> > > > > > > This
> > > > > > > > approach currently does not support parquet filter
pushdown
> but
> > > > this
> > > > > > > might
> > > > > > > > be added in the next release after 1.9.0.
> > > > > > > >
> > > > > > > > - Rahul
> > > > > > > >
> > > > > > > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <
> > > sonnyheer@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I'm running a drill query with a where clause
on a
> > > > non-partitioned
> > > > > > > column
> > > > > > > > > via hive storage plugin.  This query inspects
all
> partitions
> > > > (kind
> > > > > of
> > > > > > > > > expected), but when i run the same query in Hive
I can see
> a
> > > > > > predicate
> > > > > > > > > passed down to the query plan.  This particular
query is
> much
> > > > > faster
> > > > > > in
> > > > > > > > > Hive vs Drill.  BTW these are parquet files.
> > > > > > > > >
> > > > > > > > > Hive:
> > > > > > > > >
> > > > > > > > > Stage-0
> > > > > > > > >
> > > > > > > > > Fetch Operator
> > > > > > > > >
> > > > > > > > > limit:-1
> > > > > > > > >
> > > > > > > > > Select Operator [SEL_2]
> > > > > > > > >
> > > > > > > > > outputColumnNames:["_col0"]
> > > > > > > > >
> > > > > > > > > Filter Operator [FIL_4]
> > > > > > > > >
> > > > > > > > > predicate:(my_column = 123) (type: boolean)
> > > > > > > > >
> > > > > > > > > TableScan [TS_0]
> > > > > > > > >
> > > > > > > > > alias:my_table
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Any idea on why this is?  My guess is Hive is
storing hive
> > > > specific
> > > > > > > info
> > > > > > > > in
> > > > > > > > > the parquet file since it was created through
Hive.
> Although
> > > it
> > > > > > seems
> > > > > > > > > drill-hive plugin should honor this to.  Not
sure, but
> > willing
> > > to
> > > > > > look
> > > > > > > > > through code if someone can point me in the right
> direction.
> > > > > Thanks!
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > >
> > > > > > > Pushpinder S. Heer
> > > > > > > Senior Software Engineer
> > > > > > > m: 360-434-4354 h: 509-884-2574
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > >
> > > > > Pushpinder S. Heer
> > > > > Senior Software Engineer
> > > > > m: 360-434-4354 h: 509-884-2574
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >
> > > Pushpinder S. Heer
> > > Senior Software Engineer
> > > m: 360-434-4354 h: 509-884-2574
> > >
> >
>
>
>
> --
>
>
> Pushpinder S. Heer
> Senior Software Engineer
> m: 360-434-4354 h: 509-884-2574
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message