spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: HiveContext
Date Fri, 01 Jul 2016 14:35:53 GMT
hi,

In general if your ORC tables is not bucketed it is not going to do much.

the idea is that using predicate pushdown you will only get the data from
the partition concerned and avoid expensive table scans!

Orc provides what is known as store index at file, stripe and rowset levels
(default 10K rows). That is just statistics for min, avg and max for each
column.

Now going back to practicality, you can do a simple test. log in to hive
and run your query with EXPLAIN EXTENDED  select ... and see what you see.

then try it from Spark. As far as I am aware Spark will not rely on
anythinh Hive wise, except the metadata. it will use DAG and in-memory
capability to do the query.

just try it and see.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 July 2016 at 11:00, manish jaiswal <manishsrm14@gmail.com> wrote:

> Hi,
>
> Using sparkHiveContext when we read all rows where age was between 0 and
> 100, even though we requested rows where age was less than 15. Such full
> table scanning is an expensive operation.
>
> ORC avoids this type of overhead by using predicate push-down with three
> levels of built-in indexes within each file: file level, stripe level, and
> row level:
>
>    -
>
>    File and stripe level statistics are in the file footer, making it
>    easy to determine if the rest of the file needs to be read.
>    -
>
>    Row level indexes include column statistics for each row group and
>    position, for seeking to the start of the row group.
>
> ORC utilizes these indexes to move the filter operation to the data
> loading phase, by reading only data that potentially includes required rows.
>
>
> My doubt is when we give some query to hiveContext in orc table using
> spark with
>
> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
>
> how it will perform
>
> 1.it will fetch only those record from orc file according to query.or
>
> 2.it will take orc file in spark and then perform spark job using predicate push-down
>
> and give you the records.
>
> (I am aware of hiveContext gives spark only metadata and location of the data)
>
>
> Thanks
>
> Manish
>
>

Mime
View raw message