drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rahul challapalli <challapallira...@gmail.com>
Subject Re: Drill where clause vs Hive on non-partition column
Date Tue, 15 Nov 2016 18:25:52 GMT
Robert's suggestion is with using the DFS plugin. If you directly use DFS
instead of hive plugin then

1. DFS plugin has to determine the underlying data format on the fly.
2. DFS plugin does not know the schema in advance. But in the case parquet
drill would get this information from the parquet metadata. However if the
hive table is backed by a csv file, then you cast the columns appropriately
in the query or create a view.
3. If the underlying hive table is partitioned, then drill does not know
anything about partitions. However since hive partitions are just
sub-directories, you can still modify the query to take advantage of
drill's directory based partitioning
4. In terms of performance, I am not aware of any published benchmarks
comparing hive plugin and dfs plugin for parquet format. But from my
general experience it appears as though DFS plugin is faster.

Also do not forget the 3rd option in my first response (Hive Plugin + Drill
native parquet reader). We do have plans to support filter pushdown for
this scenario in the future.

- Rahul

On Tue, Nov 15, 2016 at 8:01 AM, Sonny Heer <sonnyheer@gmail.com> wrote:

> Thanks Robert.
>
> "You can then use Drill to query the Hive table and get predicate pushdown"
>
> This is using the DFS plugin and going directly to the hive table folder?
>
> Can someone speak to what advantages there are to use the hive plugin vs
> going directly to dfs
>
> On Tue, Nov 15, 2016 at 12:32 AM, Robert Hou <rhou@maprtech.com> wrote:
>
> > I have used Hive 1.2 and I have found that the stats in parquet files are
> > populated for some data types.  Integer, bigint, float, double, date
> work.
> > String does not seem to work.
> >
> > You can then use Drill to query the Hive table and get predicate pushdown
> > for simple compare filters.  This has the form "where col = value".
> Other
> > standard operators are !=, <, <=, >, >=.  Compound filters can use
> "and/or"
> > logic.  This will be supported in Drill 1.9.
> >
> > In the future, we will add expressions and functions.
> >
> > Thanks.
> >
> > --Robert
> >
> >
> > On Mon, Nov 14, 2016 at 3:53 PM, Sonny Heer <sonnyheer@gmail.com> wrote:
> >
> > > Is there a way to do that during the creation of the parquet table?
> > Might
> > > be a hive question but all we do is 'STORED AS parquet' and then during
> > > insert set the parquet.* properties.  I'm just trying to see if #2 is
> an
> > > option for us to utilize filter pushdown via dfs
> > >
> > > On Mon, Nov 14, 2016 at 3:43 PM, rahul challapalli <
> > > challapallirahul@gmail.com> wrote:
> > >
> > > > I do not know of any plans to support filter pushdown when using the
> > hive
> > > > plugin.
> > > > If you run analyze stats then hive computes the table stats and
> stores
> > > them
> > > > in the hive metastore for the relevant table. I believe drill uses
> some
> > > of
> > > > these stats. However running analyze stats command does not alter(or
> > add)
> > > > the metadata in the parquet files themselves. The parquet level
> > metadata
> > > > should be written when the parquet file itself is created in the
> first
> > > > place.
> > > >
> > > > - Rahul
> > > >
> > > > On Mon, Nov 14, 2016 at 3:32 PM, Sonny Heer <sonnyheer@gmail.com>
> > wrote:
> > > >
> > > > > Rahul,
> > > > >
> > > > > Thanks for the details.  Is there any plans to support filter
> > pushdown
> > > > for
> > > > > #1?  Do you know if we run analyze stats through hive on a parquet
> > file
> > > > if
> > > > > that will have enough info to do the pushdown?
> > > > >
> > > > > Thanks again.
> > > > >
> > > > > On Mon, Nov 14, 2016 at 9:50 AM, rahul challapalli <
> > > > > challapallirahul@gmail.com> wrote:
> > > > >
> > > > > > Sonny,
> > > > > >
> > > > > > If the underlying data in the hive table is in parquet format,
> > there
> > > > are
> > > > > 3
> > > > > > ways to query from drill :
> > > > > >
> > > > > > 1. Using the hive plugin : This does not support filter pushdown
> > for
> > > > any
> > > > > > formats (ORC, Parquet, Text...etc)
> > > > > > 2. Directly Querying the folder in maprfs/hdfs which contains
the
> > > > parquet
> > > > > > files using DFS plugin: With DRILL-1950, we can now do a filter
> > > > pushdown
> > > > > > into the parquet files. In order to take advantage of this
> feature,
> > > the
> > > > > > underlying parquet files should have the relevant stats. This
> > feature
> > > > > will
> > > > > > only be available with the 1.9.0 release
> > > > > > 3. Using the drill's native parquet reader in conjunction with
> the
> > > hive
> > > > > > plugin (See store.hive.optimize_scan_with_native_readers) :
This
> > > > allows
> > > > > > drill to fetch all the metadata about the hive table from the
> > > metastore
> > > > > and
> > > > > > then drill uses its own parquet reader for actually reading
the
> > > files.
> > > > > This
> > > > > > approach currently does not support parquet filter pushdown
but
> > this
> > > > > might
> > > > > > be added in the next release after 1.9.0.
> > > > > >
> > > > > > - Rahul
> > > > > >
> > > > > > On Sun, Nov 13, 2016 at 11:06 AM, Sonny Heer <
> sonnyheer@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > I'm running a drill query with a where clause on a
> > non-partitioned
> > > > > column
> > > > > > > via hive storage plugin.  This query inspects all partitions
> > (kind
> > > of
> > > > > > > expected), but when i run the same query in Hive I can
see a
> > > > predicate
> > > > > > > passed down to the query plan.  This particular query is
much
> > > faster
> > > > in
> > > > > > > Hive vs Drill.  BTW these are parquet files.
> > > > > > >
> > > > > > > Hive:
> > > > > > >
> > > > > > > Stage-0
> > > > > > >
> > > > > > > Fetch Operator
> > > > > > >
> > > > > > > limit:-1
> > > > > > >
> > > > > > > Select Operator [SEL_2]
> > > > > > >
> > > > > > > outputColumnNames:["_col0"]
> > > > > > >
> > > > > > > Filter Operator [FIL_4]
> > > > > > >
> > > > > > > predicate:(my_column = 123) (type: boolean)
> > > > > > >
> > > > > > > TableScan [TS_0]
> > > > > > >
> > > > > > > alias:my_table
> > > > > > >
> > > > > > >
> > > > > > > Any idea on why this is?  My guess is Hive is storing hive
> > specific
> > > > > info
> > > > > > in
> > > > > > > the parquet file since it was created through Hive.  Although
> it
> > > > seems
> > > > > > > drill-hive plugin should honor this to.  Not sure, but
willing
> to
> > > > look
> > > > > > > through code if someone can point me in the right direction.
> > > Thanks!
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > >
> > > > > Pushpinder S. Heer
> > > > > Senior Software Engineer
> > > > > m: 360-434-4354 h: 509-884-2574
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > >
> > > Pushpinder S. Heer
> > > Senior Software Engineer
> > > m: 360-434-4354 h: 509-884-2574
> > >
> >
>
>
>
> --
>
>
> Pushpinder S. Heer
> Senior Software Engineer
> m: 360-434-4354 h: 509-884-2574
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message