spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anand Mohan Tumuluri <>
Subject Re: Spark SQL HiveContext Projection Pushdown
Date Thu, 09 Oct 2014 15:58:46 GMT
Thanks a lot. It is working like a charm now. Even predicate push down on
optional fields is working.
Is there any plan to support windowing queries? I know that Shark supported
it in its last release and expected it to be already included.

Best regards,
Anand Mohan
On Oct 8, 2014 11:57 AM, "Michael Armbrust" <> wrote:

> We are working to improve the integration here, but I can recommend the
> following when running spark 1.1:  create an external table and
> set spark.sql.hive.convertMetastoreParquet=true
> Note that even with a HiveContext we don't support window functions yet.
> On Wed, Oct 8, 2014 at 10:41 AM, Anand Mohan <> wrote:
>> We have our analytics infra built on Spark and Parquet.
>> We are trying to replace some of our queries based on the direct Spark
>> RDD API to SQL based either on Spark SQL/HiveQL.
>> Our motivation was to take advantage of the transparent projection &
>> predicate pushdown that's offered by Spark SQL and eliminate the need for persisting
>> the RDD in memory. (Cache invalidation turned out to be a big problem for
>> us)
>> The below tests are done with Spark 1.1.0 on CDH 5.1.0
>> 1. Spark SQL's (SQLContext) Parquet support was excellent for our case.
>> The ability to query in SQL and apply scala functions as UDFs in the SQL is
>> extremely convenient. Project pushdown works flawlessly, not much sure
>> about predicate pushdown
>> (we have 90% optional fields in our dataset and I remember Michael
>> Armbrust telling me that this is a bug in Parquet in that it doesnt allow
>> predicate pushdown for optional fields.)
>> However we have timestamp based duplicate removal which requires
>> windowing queries which are not working in SQLContext.sql parsing mode.
>> 2. We then tried HiveQL using HiveContext by creating a Hive external
>> table backed by the same Parquet data. However, in this mode, projection
>> pushdown doesnt seem to work and it ends up reading the whole Parquet data
>> for each query.(which slows down a lot)
>> Please see attached the screenshot of this.
>> Hive itself doesnt seem to have any issues with the projection pushdown.
>> So this is weird. Is this due to any configuration problem?
>> Thanks in advance,
>> Anand Mohan
>> *SparkSQLHiveParquet.png* (316K) Download Attachment
>> <>
>> ------------------------------
>> View this message in context: Spark SQL HiveContext Projection Pushdown
>> <>
>> Sent from the Apache Spark User List mailing list archive
>> <> at

View raw message