drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Hou <r...@maprtech.com>
Subject Re: Drill where clause vs Hive on non-partition column
Date Thu, 17 Nov 2016 23:56:38 GMT
Let me clarify my comment regarding the Hive plugin.  We plan to support
parquet filter pushdown with the Hive plugin when using our native reader
(there is an option you can set).  This will be at some point in the future.

Thanks.

--Robert

On Thu, Nov 17, 2016 at 3:46 PM, Robert Hou <rhou@maprtech.com> wrote:

> There will be some support for Hive tables in 1.9.  It will use the dfs
> plugin but not the Hive plugin.  Support for the Hive plugin is planned for
> the future, but it is not committed at this time.  There will be a ticket
> soon.
>
> Thanks.
>
> --Robert
>
> On Thu, Nov 17, 2016 at 1:31 PM, Sonny Heer <sonnyheer@gmail.com> wrote:
>
>> Ok thanks Rahul.  Yeah we have to use the hive plugin I believe since we
>> don't control the sql generated.
>>
>> The reason I ask if its tracked is because you mentioned it was going to
>> be
>> part of 1.9 ...?
>>
>> On Wed, Nov 16, 2016 at 11:26 AM, rahul challapalli <
>> challapallirahul@gmail.com> wrote:
>>
>> > Assume you have a hive table "revenue" partitioned by year. Then the
>> folder
>> > structure for the table on maprfs/hdfs looks something like below
>> >
>> >
>> > *revenue*
>> > *|----year=2015*
>> > *|----year=2016*
>> > *|----year=2107*
>> >
>> > Now if you want to leverage partition pruning, you can use something
>> like
>> > below
>> > Hive plugin : select count(*) from hive.revenue where `year` = 2015
>> > DFS plugin : select count(*) from dfs.`/user/hive/warehouse/revenue`
>> where
>> > dir0='year=2015'
>> >
>> > I am not sure if we have a jira for tracking parquet filter pushdown
>> when
>> > using hive + native parquet reader
>> >
>> > - Rahul
>> >
>> > On Wed, Nov 16, 2016 at 7:24 AM, Sonny Heer <sonnyheer@gmail.com>
>> wrote:
>> >
>> > > thats a lot of good information Rahul!! - thanks.
>> > >
>> > > "modify the query to take advantage of drill's directory based
>> > > partitioning"
>> > >
>> > > What does this entail?  Do you have to tell it on which column the
>> > > directories are partitioned by?
>> > >
>> > > I think option 3 is probably the way to go.  Is there a ticket
>> tracking
>> > > work on this?
>> > >
>> > > Thanks again
>> > >
>> > > On Tue, Nov 15, 2016 at 10:25 AM, rahul challapalli <
>> > > challapallirahul@gmail.com> wrote:
>> > >
>> > > > Robert's suggestion is with using the DFS plugin. If you directly
>> use
>> > DFS
>> > > > instead of hive plugin then
>> > > >
>> > > > 1. DFS plugin has to determine the underlying data format on the
>> fly.
>> > > > 2. DFS plugin does not know the schema in advance. But in the case
>> > > parquet
>> > > > drill would get this information from the parquet metadata. However
>> if
>> > > the
>> > > > hive table is backed by a csv file, then you cast the columns
>> > > appropriately
>> > > > in the query or create a view.
>> > > > 3. If the underlying hive table is partitioned, then drill does not
>> > know
>> > > > anything about partitions. However since hive partitions are just
>> > > > sub-directories, you can still modify the query to take advantage
of
>> > > > drill's directory based partitioning
>> > > > 4. In terms of performance, I am not aware of any published
>> benchmarks
>> > > > comparing hive plugin and dfs plugin for parquet format. But from
my
>> > > > general experience it appears as though DFS plugin is faster.
>> > > >
>> > > > Also do not forget the 3rd option in my first response (Hive Plugin
>> +
>> > > Drill
>> > > > native parquet reader). We do have plans to support filter pushdown
>> for
>> > > > this scenario in the future.
>> > > >
>> > > > - Rahul
>> > > >
>> > > > On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer <sonnyheer@gmail.com>
>> > wrote:
>> > > >
>> > > > > Thanks Robert.
>> > > > >
>> > > > > "You can then use Drill to query the Hive table and get predicate
>> > > > pushdown"
>> > > > >
>> > > > > This is using the DFS plugin and going directly to the hive table
>> > > folder?
>> > > > >
>> > > > > Can someone speak to what advantages there are to use the hive
>> plugin
>> > > vs
>> > > > > going directly to dfs
>> > > > >
>> > > > > On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou <rhou@maprtech.com>
>> > > wrote:
>> > > > >
>> > > > > > I have used Hive 1.2 and I have found that the stats in
parquet
>> > files
>> > > > are
>> > > > > > populated for some data types.  Integer, bigint, float,
double,
>> > date
>> > > > > work.
>> > > > > > String does not seem to work.
>> > > > > >
>> > > > > > You can then use Drill to query the Hive table and get predicate
>> > > > pushdown
>> > > > > > for simple compare filters.  This has the form "where col
=
>> value".
>> > > > > Other
>> > > > > > standard operators are !=, <, <=, >, >=.  Compound
filters can
>> use
>> > > > > "and/or"
>> > > > > > logic.  This will be supported in Drill 1.9.
>> > > > > >
>> > > > > > In the future, we will add expressions and functions.
>> > > > > >
>> > > > > > Thanks.
>> > > > > >
>> > > > > > --Robert
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer <
>> sonnyheer@gmail.com>
>> > > > wrote:
>> > > > > >
>> > > > > > > Is there a way to do that during the creation of the
parquet
>> > table?
>> > > > > > Might
>> > > > > > > be a hive question but all we do is 'STORED AS parquet'
and
>> then
>> > > > during
>> > > > > > > insert set the parquet.* properties.  I'm just trying
to see
>> if
>> > #2
>> > > is
>> > > > > an
>> > > > > > > option for us to utilize filter pushdown via dfs
>> > > > > > >
>> > > > > > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli
<
>> > > > > > > challapallirahul@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > I do not know of any plans to support filter pushdown
when
>> > using
>> > > > the
>> > > > > > hive
>> > > > > > > > plugin.
>> > > > > > > > If you run analyze stats then hive computes the
table stats
>> and
>> > > > > stores
>> > > > > > > them
>> > > > > > > > in the hive metastore for the relevant table.
I believe
>> drill
>> > > uses
>> > > > > some
>> > > > > > > of
>> > > > > > > > these stats. However running analyze stats command
does not
>> > > > alter(or
>> > > > > > add)
>> > > > > > > > the metadata in the parquet files themselves.
The parquet
>> level
>> > > > > > metadata
>> > > > > > > > should be written when the parquet file itself
is created in
>> > the
>> > > > > first
>> > > > > > > > place.
>> > > > > > > >
>> > > > > > > > - Rahul
>> > > > > > > >
>> > > > > > > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <
>> > sonnyheer@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Rahul,
>> > > > > > > > >
>> > > > > > > > > Thanks for the details.  Is there any plans
to support
>> filter
>> > > > > > pushdown
>> > > > > > > > for
>> > > > > > > > > #1?  Do you know if we run analyze stats
through hive on a
>> > > > parquet
>> > > > > > file
>> > > > > > > > if
>> > > > > > > > > that will have enough info to do the pushdown?
>> > > > > > > > >
>> > > > > > > > > Thanks again.
>> > > > > > > > >
>> > > > > > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli
<
>> > > > > > > > > challapallirahul@gmail.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Sonny,
>> > > > > > > > > >
>> > > > > > > > > > If the underlying data in the hive table
is in parquet
>> > > format,
>> > > > > > there
>> > > > > > > > are
>> > > > > > > > > 3
>> > > > > > > > > > ways to query from drill :
>> > > > > > > > > >
>> > > > > > > > > > 1. Using the hive plugin : This does
not support filter
>> > > > pushdown
>> > > > > > for
>> > > > > > > > any
>> > > > > > > > > > formats (ORC, Parquet, Text...etc)
>> > > > > > > > > > 2. Directly Querying the folder in maprfs/hdfs
which
>> > contains
>> > > > the
>> > > > > > > > parquet
>> > > > > > > > > > files using DFS plugin: With DRILL-1950,
we can now do a
>> > > filter
>> > > > > > > > pushdown
>> > > > > > > > > > into the parquet files. In order to
take advantage of
>> this
>> > > > > feature,
>> > > > > > > the
>> > > > > > > > > > underlying parquet files should have
the relevant stats.
>> > This
>> > > > > > feature
>> > > > > > > > > will
>> > > > > > > > > > only be available with the 1.9.0 release
>> > > > > > > > > > 3. Using the drill's native parquet
reader in
>> conjunction
>> > > with
>> > > > > the
>> > > > > > > hive
>> > > > > > > > > > plugin (See store.hive.optimize_scan_with_
>> native_readers)
>> > :
>> > > > This
>> > > > > > > > allows
>> > > > > > > > > > drill to fetch all the metadata about
the hive table
>> from
>> > the
>> > > > > > > metastore
>> > > > > > > > > and
>> > > > > > > > > > then drill uses its own parquet reader
for actually
>> reading
>> > > the
>> > > > > > > files.
>> > > > > > > > > This
>> > > > > > > > > > approach currently does not support
parquet filter
>> pushdown
>> > > but
>> > > > > > this
>> > > > > > > > > might
>> > > > > > > > > > be added in the next release after 1.9.0.
>> > > > > > > > > >
>> > > > > > > > > > - Rahul
>> > > > > > > > > >
>> > > > > > > > > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny
Heer <
>> > > > > sonnyheer@gmail.com>
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I'm running a drill query with
a where clause on a
>> > > > > > non-partitioned
>> > > > > > > > > column
>> > > > > > > > > > > via hive storage plugin.  This
query inspects all
>> > > partitions
>> > > > > > (kind
>> > > > > > > of
>> > > > > > > > > > > expected), but when i run the same
query in Hive I can
>> > see
>> > > a
>> > > > > > > > predicate
>> > > > > > > > > > > passed down to the query plan.
 This particular query
>> is
>> > > much
>> > > > > > > faster
>> > > > > > > > in
>> > > > > > > > > > > Hive vs Drill.  BTW these are parquet
files.
>> > > > > > > > > > >
>> > > > > > > > > > > Hive:
>> > > > > > > > > > >
>> > > > > > > > > > > Stage-0
>> > > > > > > > > > >
>> > > > > > > > > > > Fetch Operator
>> > > > > > > > > > >
>> > > > > > > > > > > limit:-1
>> > > > > > > > > > >
>> > > > > > > > > > > Select Operator [SEL_2]
>> > > > > > > > > > >
>> > > > > > > > > > > outputColumnNames:["_col0"]
>> > > > > > > > > > >
>> > > > > > > > > > > Filter Operator [FIL_4]
>> > > > > > > > > > >
>> > > > > > > > > > > predicate:(my_column = 123) (type:
boolean)
>> > > > > > > > > > >
>> > > > > > > > > > > TableScan [TS_0]
>> > > > > > > > > > >
>> > > > > > > > > > > alias:my_table
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > Any idea on why this is?  My guess
is Hive is storing
>> > hive
>> > > > > > specific
>> > > > > > > > > info
>> > > > > > > > > > in
>> > > > > > > > > > > the parquet file since it was created
through Hive.
>> > > Although
>> > > > > it
>> > > > > > > > seems
>> > > > > > > > > > > drill-hive plugin should honor
this to.  Not sure, but
>> > > > willing
>> > > > > to
>> > > > > > > > look
>> > > > > > > > > > > through code if someone can point
me in the right
>> > > direction.
>> > > > > > > Thanks!
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Pushpinder S. Heer
>> > > > > > > > > Senior Software Engineer
>> > > > > > > > > m: 360-434-4354 h: 509-884-2574
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > >
>> > > > > > >
>> > > > > > > Pushpinder S. Heer
>> > > > > > > Senior Software Engineer
>> > > > > > > m: 360-434-4354 h: 509-884-2574
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > >
>> > > > > Pushpinder S. Heer
>> > > > > Senior Software Engineer
>> > > > > m: 360-434-4354 h: 509-884-2574
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > >
>> > > Pushpinder S. Heer
>> > > Senior Software Engineer
>> > > m: 360-434-4354 h: 509-884-2574
>> > >
>> >
>>
>>
>>
>> --
>>
>>
>> Pushpinder S. Heer
>> Senior Software Engineer
>> m: 360-434-4354 h: 509-884-2574
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message