drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonny Heer <sonnyh...@gmail.com>
Subject Re: Drill where clause vs Hive on non-partition column
Date Mon, 14 Nov 2016 23:32:48 GMT
Rahul,

Thanks for the details.  Is there any plans to support filter pushdown for
#1?  Do you know if we run analyze stats through hive on a parquet file if
that will have enough info to do the pushdown?

Thanks again.

On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Sonny,
>
> If the underlying data in the hive table is in parquet format, there are 3
> ways to query from drill :
>
> 1. Using the hive plugin : This does not support filter pushdown for any
> formats (ORC, Parquet, Text...etc)
> 2. Directly Querying the folder in maprfs/hdfs which contains the parquet
> files using DFS plugin: With DRILL-1950, we can now do a filter pushdown
> into the parquet files. In order to take advantage of this feature, the
> underlying parquet files should have the relevant stats. This feature will
> only be available with the 1.9.0 release
> 3. Using the drill's native parquet reader in conjunction with the hive
> plugin (See store.hive.optimize_scan_with_native_readers) : This allows
> drill to fetch all the metadata about the hive table from the metastore and
> then drill uses its own parquet reader for actually reading the files. This
> approach currently does not support parquet filter pushdown but this might
> be added in the next release after 1.9.0.
>
> - Rahul
>
> On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <sonnyheer@gmail.com> wrote:
>
> > I'm running a drill query with a where clause on a non-partitioned column
> > via hive storage plugin.  This query inspects all partitions (kind of
> > expected), but when i run the same query in Hive I can see a predicate
> > passed down to the query plan.  This particular query is much faster in
> > Hive vs Drill.  BTW these are parquet files.
> >
> > Hive:
> >
> > Stage-0
> >
> > Fetch Operator
> >
> > limit:-1
> >
> > Select Operator [SEL_2]
> >
> > outputColumnNames:["_col0"]
> >
> > Filter Operator [FIL_4]
> >
> > predicate:(my_column = 123) (type: boolean)
> >
> > TableScan [TS_0]
> >
> > alias:my_table
> >
> >
> > Any idea on why this is?  My guess is Hive is storing hive specific info
> in
> > the parquet file since it was created through Hive.  Although it seems
> > drill-hive plugin should honor this to.  Not sure, but willing to look
> > through code if someone can point me in the right direction.  Thanks!
> >
> > --
> >
>



-- 


Pushpinder S. Heer
Senior Software Engineer
m: 360-434-4354 h: 509-884-2574

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message