drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Drill where clause vs Hive on non-partition column
Date Mon, 14 Nov 2016 23:43:16 GMT
I do not know of any plans to support filter pushdown when using the hive
plugin.
If you run analyze stats then hive computes the table stats and stores them
in the hive metastore for the relevant table. I believe drill uses some of
these stats. However running analyze stats command does not alter(or add)
the metadata in the parquet files themselves. The parquet level metadata
should be written when the parquet file itself is created in the first
place.

- Rahul

On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <sonnyheer@gmail.com> wrote:

> Rahul,
>
> Thanks for the details.  Is there any plans to support filter pushdown for
> #1?  Do you know if we run analyze stats through hive on a parquet file if
> that will have enough info to do the pushdown?
>
> Thanks again.
>
> On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
> challapallirahul@gmail.com> wrote:
>
> > Sonny,
> >
> > If the underlying data in the hive table is in parquet format, there are
> 3
> > ways to query from drill :
> >
> > 1. Using the hive plugin : This does not support filter pushdown for any
> > formats (ORC, Parquet, Text...etc)
> > 2. Directly Querying the folder in maprfs/hdfs which contains the parquet
> > files using DFS plugin: With DRILL-1950, we can now do a filter pushdown
> > into the parquet files. In order to take advantage of this feature, the
> > underlying parquet files should have the relevant stats. This feature
> will
> > only be available with the 1.9.0 release
> > 3. Using the drill's native parquet reader in conjunction with the hive
> > plugin (See store.hive.optimize_scan_with_native_readers) : This allows
> > drill to fetch all the metadata about the hive table from the metastore
> and
> > then drill uses its own parquet reader for actually reading the files.
> This
> > approach currently does not support parquet filter pushdown but this
> might
> > be added in the next release after 1.9.0.
> >
> > - Rahul
> >
> > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <sonnyheer@gmail.com>
> wrote:
> >
> > > I'm running a drill query with a where clause on a non-partitioned
> column
> > > via hive storage plugin.  This query inspects all partitions (kind of
> > > expected), but when i run the same query in Hive I can see a predicate
> > > passed down to the query plan.  This particular query is much faster in
> > > Hive vs Drill.  BTW these are parquet files.
> > >
> > > Hive:
> > >
> > > Stage-0
> > >
> > > Fetch Operator
> > >
> > > limit:-1
> > >
> > > Select Operator [SEL_2]
> > >
> > > outputColumnNames:["_col0"]
> > >
> > > Filter Operator [FIL_4]
> > >
> > > predicate:(my_column = 123) (type: boolean)
> > >
> > > TableScan [TS_0]
> > >
> > > alias:my_table
> > >
> > >
> > > Any idea on why this is?  My guess is Hive is storing hive specific
> info
> > in
> > > the parquet file since it was created through Hive.  Although it seems
> > > drill-hive plugin should honor this to.  Not sure, but willing to look
> > > through code if someone can point me in the right direction.  Thanks!
> > >
> > > --
> > >
> >
>
>
>
> --
>
>
> Pushpinder S. Heer
> Senior Software Engineer
> m: 360-434-4354 h: 509-884-2574
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message