spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Running SparkSql against Hive tables
Date Tue, 09 Jun 2015 01:56:51 GMT

On 6/9/15 8:42 AM, James Pirz wrote:
> Thanks for the help!
> I am actually trying Spark SQL to run queries against tables that I've 
> defined in Hive.
>
> I follow theses steps:
> - I start hiveserver2 and in Spark, I start Spark's Thrift server by:
> $SPARK_HOME/sbin/start-thriftserver.sh --master 
> spark://spark-master-node-ip:7077
>
> - and I start beeline:
> $SPARK_HOME/bin/beeline
>
> - In my beeline session, I connect to my running hiveserver2
> !connect jdbc:hive2://hive-node-ip:10000
>
> and I can run queries successfully. But based on hiveserver2 logs, It 
> seems it actually uses "Hadoop's MR" to run queries,  *not* Spark's 
> workers. My goals is to access Hive's tables' data, but run queries 
> through Spark SQL using Spark workers (not Hadoop).
Hm, interesting. HiveThriftServer2 should never issue MR jobs to perform 
queries. I did receive two reports in the past which also say MR jobs 
instead of Spark jobs were issued to perform the SQL query. However, I 
only reproduced this issue in a rare corner case, which uses HTTP mode 
to connect to Hive 0.12.0. Apparently this isn't your case. Would you 
mind to provide more details so that I can dig in?  The following 
information would be very helpful:

1. Hive version
2. A copy of your hive-site.xml
3. Hadoop version
4. Full HiveThriftServer2 log (which can be found in $SPARK_HOME/logs)

Thanks in advance!
>
> Is it possible to do that via Spark SQL (its CLI) or through its 
> thrift server ? (I tried to find some basic examples in the 
> documentation, but I was not able to) - Any suggestion or hint on how 
> I can do that would be highly appreciated.
>
> Thnx
>
> On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian <lian.cs.zju@gmail.com 
> <mailto:lian.cs.zju@gmail.com>> wrote:
>
>
>
>     On 6/6/15 9:06 AM, James Pirz wrote:
>>     I am pretty new to Spark, and using Spark 1.3.1, I am trying to
>>     use 'Spark SQL' to run some SQL scripts, on the cluster. I
>>     realized that for a better performance, it is a good idea to use
>>     Parquet files. I have 2 questions regarding that:
>>
>>     1) If I wanna use Spark SQL against  *partitioned & bucketed*
>>     tables with Parquet format in Hive, does the provided spark
>>     binary on the apache website support that or do I need to build a
>>     new spark binary with some additional flags ? (I found a note
>>     <https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables>
in
>>     the documentation about enabling Hive support, but I could not
>>     fully get it as what the correct way of building is, if I need to
>>     build)
>     Yes, Hive support is enabled by default now for the binaries on
>     the website. However, currently Spark SQL doesn't support buckets yet.
>>
>>     2) Does running Spark SQL against tables in Hive downgrade the
>>     performance, and it is better that I load parquet files directly
>>     to HDFS or having Hive in the picture is harmless ?
>     If you're using Parquet, then it should be fine since by default
>     Spark SQL uses its own native Parquet support to read Parquet Hive
>     tables.
>>
>>     Thnx
>>
>
>


Mime
View raw message