drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonny Heer <sonnyh...@gmail.com>
Subject Re: Drill where clause vs Hive on non-partition column
Date Thu, 17 Nov 2016 21:31:53 GMT
Ok thanks Rahul.  Yeah we have to use the hive plugin I believe since we
don't control the sql generated.

The reason I ask if its tracked is because you mentioned it was going to be
part of 1.9 ...?

On Wed, Nov 16, 2016 at 11:26 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Assume you have a hive table "revenue" partitioned by year. Then the folder
> structure for the table on maprfs/hdfs looks something like below
>
>
> *revenue*
> *|----year=2015*
> *|----year=2016*
> *|----year=2107*
>
> Now if you want to leverage partition pruning, you can use something like
> below
> Hive plugin : select count(*) from hive.revenue where `year` = 2015
> DFS plugin : select count(*) from dfs.`/user/hive/warehouse/revenue` where
> dir0='year=2015'
>
> I am not sure if we have a jira for tracking parquet filter pushdown when
> using hive + native parquet reader
>
> - Rahul
>
> On Wed, Nov 16, 2016 at 7:24 AM, Sonny Heer <sonnyheer@gmail.com> wrote:
>
> > thats a lot of good information Rahul!! - thanks.
> >
> > "modify the query to take advantage of drill's directory based
> > partitioning"
> >
> > What does this entail?  Do you have to tell it on which column the
> > directories are partitioned by?
> >
> > I think option 3 is probably the way to go.  Is there a ticket tracking
> > work on this?
> >
> > Thanks again
> >
> > On Tue, Nov 15, 2016 at 10:25 AM, rahul challapalli <
> > challapallirahul@gmail.com> wrote:
> >
> > > Robert's suggestion is with using the DFS plugin. If you directly use
> DFS
> > > instead of hive plugin then
> > >
> > > 1. DFS plugin has to determine the underlying data format on the fly.
> > > 2. DFS plugin does not know the schema in advance. But in the case
> > parquet
> > > drill would get this information from the parquet metadata. However if
> > the
> > > hive table is backed by a csv file, then you cast the columns
> > appropriately
> > > in the query or create a view.
> > > 3. If the underlying hive table is partitioned, then drill does not
> know
> > > anything about partitions. However since hive partitions are just
> > > sub-directories, you can still modify the query to take advantage of
> > > drill's directory based partitioning
> > > 4. In terms of performance, I am not aware of any published benchmarks
> > > comparing hive plugin and dfs plugin for parquet format. But from my
> > > general experience it appears as though DFS plugin is faster.
> > >
> > > Also do not forget the 3rd option in my first response (Hive Plugin +
> > Drill
> > > native parquet reader). We do have plans to support filter pushdown for
> > > this scenario in the future.
> > >
> > > - Rahul
> > >
> > > On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer <sonnyheer@gmail.com>
> wrote:
> > >
> > > > Thanks Robert.
> > > >
> > > > "You can then use Drill to query the Hive table and get predicate
> > > pushdown"
> > > >
> > > > This is using the DFS plugin and going directly to the hive table
> > folder?
> > > >
> > > > Can someone speak to what advantages there are to use the hive plugin
> > vs
> > > > going directly to dfs
> > > >
> > > > On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou <rhou@maprtech.com>
> > wrote:
> > > >
> > > > > I have used Hive 1.2 and I have found that the stats in parquet
> files
> > > are
> > > > > populated for some data types.  Integer, bigint, float, double,
> date
> > > > work.
> > > > > String does not seem to work.
> > > > >
> > > > > You can then use Drill to query the Hive table and get predicate
> > > pushdown
> > > > > for simple compare filters.  This has the form "where col = value".
> > > > Other
> > > > > standard operators are !=, <, <=, >, >=.  Compound filters
can use
> > > > "and/or"
> > > > > logic.  This will be supported in Drill 1.9.
> > > > >
> > > > > In the future, we will add expressions and functions.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > --Robert
> > > > >
> > > > >
> > > > > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer <sonnyheer@gmail.com>
> > > wrote:
> > > > >
> > > > > > Is there a way to do that during the creation of the parquet
> table?
> > > > > Might
> > > > > > be a hive question but all we do is 'STORED AS parquet' and
then
> > > during
> > > > > > insert set the parquet.* properties.  I'm just trying to see
if
> #2
> > is
> > > > an
> > > > > > option for us to utilize filter pushdown via dfs
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli <
> > > > > > challapallirahul@gmail.com> wrote:
> > > > > >
> > > > > > > I do not know of any plans to support filter pushdown when
> using
> > > the
> > > > > hive
> > > > > > > plugin.
> > > > > > > If you run analyze stats then hive computes the table stats
and
> > > > stores
> > > > > > them
> > > > > > > in the hive metastore for the relevant table. I believe
drill
> > uses
> > > > some
> > > > > > of
> > > > > > > these stats. However running analyze stats command does
not
> > > alter(or
> > > > > add)
> > > > > > > the metadata in the parquet files themselves. The parquet
level
> > > > > metadata
> > > > > > > should be written when the parquet file itself is created
in
> the
> > > > first
> > > > > > > place.
> > > > > > >
> > > > > > > - Rahul
> > > > > > >
> > > > > > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <
> sonnyheer@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Rahul,
> > > > > > > >
> > > > > > > > Thanks for the details.  Is there any plans to support
filter
> > > > > pushdown
> > > > > > > for
> > > > > > > > #1?  Do you know if we run analyze stats through hive
on a
> > > parquet
> > > > > file
> > > > > > > if
> > > > > > > > that will have enough info to do the pushdown?
> > > > > > > >
> > > > > > > > Thanks again.
> > > > > > > >
> > > > > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli
<
> > > > > > > > challapallirahul@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Sonny,
> > > > > > > > >
> > > > > > > > > If the underlying data in the hive table is in
parquet
> > format,
> > > > > there
> > > > > > > are
> > > > > > > > 3
> > > > > > > > > ways to query from drill :
> > > > > > > > >
> > > > > > > > > 1. Using the hive plugin : This does not support
filter
> > > pushdown
> > > > > for
> > > > > > > any
> > > > > > > > > formats (ORC, Parquet, Text...etc)
> > > > > > > > > 2. Directly Querying the folder in maprfs/hdfs
which
> contains
> > > the
> > > > > > > parquet
> > > > > > > > > files using DFS plugin: With DRILL-1950, we can
now do a
> > filter
> > > > > > > pushdown
> > > > > > > > > into the parquet files. In order to take advantage
of this
> > > > feature,
> > > > > > the
> > > > > > > > > underlying parquet files should have the relevant
stats.
> This
> > > > > feature
> > > > > > > > will
> > > > > > > > > only be available with the 1.9.0 release
> > > > > > > > > 3. Using the drill's native parquet reader in
conjunction
> > with
> > > > the
> > > > > > hive
> > > > > > > > > plugin (See store.hive.optimize_scan_with_native_readers)
> :
> > > This
> > > > > > > allows
> > > > > > > > > drill to fetch all the metadata about the hive
table from
> the
> > > > > > metastore
> > > > > > > > and
> > > > > > > > > then drill uses its own parquet reader for actually
reading
> > the
> > > > > > files.
> > > > > > > > This
> > > > > > > > > approach currently does not support parquet filter
pushdown
> > but
> > > > > this
> > > > > > > > might
> > > > > > > > > be added in the next release after 1.9.0.
> > > > > > > > >
> > > > > > > > > - Rahul
> > > > > > > > >
> > > > > > > > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer
<
> > > > sonnyheer@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I'm running a drill query with a where clause
on a
> > > > > non-partitioned
> > > > > > > > column
> > > > > > > > > > via hive storage plugin.  This query inspects
all
> > partitions
> > > > > (kind
> > > > > > of
> > > > > > > > > > expected), but when i run the same query
in Hive I can
> see
> > a
> > > > > > > predicate
> > > > > > > > > > passed down to the query plan.  This particular
query is
> > much
> > > > > > faster
> > > > > > > in
> > > > > > > > > > Hive vs Drill.  BTW these are parquet files.
> > > > > > > > > >
> > > > > > > > > > Hive:
> > > > > > > > > >
> > > > > > > > > > Stage-0
> > > > > > > > > >
> > > > > > > > > > Fetch Operator
> > > > > > > > > >
> > > > > > > > > > limit:-1
> > > > > > > > > >
> > > > > > > > > > Select Operator [SEL_2]
> > > > > > > > > >
> > > > > > > > > > outputColumnNames:["_col0"]
> > > > > > > > > >
> > > > > > > > > > Filter Operator [FIL_4]
> > > > > > > > > >
> > > > > > > > > > predicate:(my_column = 123) (type: boolean)
> > > > > > > > > >
> > > > > > > > > > TableScan [TS_0]
> > > > > > > > > >
> > > > > > > > > > alias:my_table
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Any idea on why this is?  My guess is Hive
is storing
> hive
> > > > > specific
> > > > > > > > info
> > > > > > > > > in
> > > > > > > > > > the parquet file since it was created through
Hive.
> > Although
> > > > it
> > > > > > > seems
> > > > > > > > > > drill-hive plugin should honor this to.
 Not sure, but
> > > willing
> > > > to
> > > > > > > look
> > > > > > > > > > through code if someone can point me in
the right
> > direction.
> > > > > > Thanks!
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > >
> > > > > > > > Pushpinder S. Heer
> > > > > > > > Senior Software Engineer
> > > > > > > > m: 360-434-4354 h: 509-884-2574
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > > Pushpinder S. Heer
> > > > > > Senior Software Engineer
> > > > > > m: 360-434-4354 h: 509-884-2574
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > >
> > > > Pushpinder S. Heer
> > > > Senior Software Engineer
> > > > m: 360-434-4354 h: 509-884-2574
> > > >
> > >
> >
> >
> >
> > --
> >
> >
> > Pushpinder S. Heer
> > Senior Software Engineer
> > m: 360-434-4354 h: 509-884-2574
> >
>



-- 


Pushpinder S. Heer
Senior Software Engineer
m: 360-434-4354 h: 509-884-2574

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message