On 6/6/15 9:06 AM, James Pirz wrote:
I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a better performance, it is a good idea to use Parquet files. I have 2 questions regarding that:

1) If I wanna use Spark SQL against  *partitioned & bucketed* tables with Parquet format in Hive, does the provided spark binary on the apache website support that or do I need to build a new spark binary with some additional flags ? (I found a note in the documentation about enabling Hive support, but I could not fully get it as what the correct way of building is, if I need to build)
Yes, Hive support is enabled by default now for the binaries on the website. However, currently Spark SQL doesn't support buckets yet.

2) Does running Spark SQL against tables in Hive downgrade the performance, and it is better that I load parquet files directly to HDFS or having Hive in the picture is harmless ?
If you're using Parquet, then it should be fine since by default Spark SQL uses its own native Parquet support to read Parquet Hive tables.