spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Running SparkSql against Hive tables
Date Wed, 10 Jun 2015 07:47:35 GMT


On 6/10/15 1:55 AM, James Pirz wrote:
> I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on 
> Hadoop 2.6.
> I looked the ThriftServer2 logs, and I realized that the server was 
> not starting properly, because of failure in creating a server socket. 
> In fact, I had passed the URI to my Hiveserver2 service, launched from 
> Hive, and the beeline in Spark was directly talking to Hive's 
> hiveserver2 and it was just using it as a Hive service.
Good to know it's not a bug :)
>
> I could fix starting the Thriftserver2 in Spark (by changing port), 
> but I guess the missing puzzle piece for me is: How does Spark SQL 
> re-uses the already created table in Hive ? I mean do I have to write 
> an application that uses HiveContext to do that and submit it to Spark 
> for execution, or is there a way to run SQL scripts directly via 
> command line (in distributed mode and on the cluster) - (Just similar 
> to the way that one would use Hive (or Shark) command line by passing 
> a query file with -f flag). Looking at the Spark SQL documentation, it 
> seems that it is possible. Please correct me if I am wrong.
Yes, Spark SQL can access Hive tables by communicating with Hive 
metastore to retrieve metadata of these tables. After starting 
HiveThriftServer2, you should be able to use Beeline to run SQL scripts.
>
> On Mon, Jun 8, 2015 at 6:56 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <mailto:lian.cs.zju@gmail.com>> wrote:
>
>
>     On 6/9/15 8:42 AM, James Pirz wrote:
>>     Thanks for the help!
>>     I am actually trying Spark SQL to run queries against tables that
>>     I've defined in Hive.
>>
>>     I follow theses steps:
>>     - I start hiveserver2 and in Spark, I start Spark's Thrift server
>>     by:
>>     $SPARK_HOME/sbin/start-thriftserver.sh --master
>>     spark://spark-master-node-ip:7077
>>
>>     - and I start beeline:
>>     $SPARK_HOME/bin/beeline
>>
>>     - In my beeline session, I connect to my running hiveserver2
>>     !connect jdbc:hive2://hive-node-ip:10000
>>
>>     and I can run queries successfully. But based on hiveserver2
>>     logs, It seems it actually uses "Hadoop's MR" to run queries,
>>      *not* Spark's workers. My goals is to access Hive's tables'
>>     data, but run queries through Spark SQL using Spark workers (not
>>     Hadoop).
>     Hm, interesting. HiveThriftServer2 should never issue MR jobs to
>     perform queries. I did receive two reports in the past which also
>     say MR jobs instead of Spark jobs were issued to perform the SQL
>     query. However, I only reproduced this issue in a rare corner
>     case, which uses HTTP mode to connect to Hive 0.12.0. Apparently
>     this isn't your case. Would you mind to provide more details so
>     that I can dig in?  The following information would be very helpful:
>
>     1. Hive version
>     2. A copy of your hive-site.xml
>     3. Hadoop version
>     4. Full HiveThriftServer2 log (which can be found in $SPARK_HOME/logs)
>
>     Thanks in advance!
>>
>>     Is it possible to do that via Spark SQL (its CLI) or through its
>>     thrift server ? (I tried to find some basic examples in the
>>     documentation, but I was not able to) - Any suggestion or hint on
>>     how I can do that would be highly appreciated.
>>
>>     Thnx
>>
>>     On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian <lian.cs.zju@gmail.com
>>     <mailto:lian.cs.zju@gmail.com>> wrote:
>>
>>
>>
>>         On 6/6/15 9:06 AM, James Pirz wrote:
>>>         I am pretty new to Spark, and using Spark 1.3.1, I am trying
>>>         to use 'Spark SQL' to run some SQL scripts, on the cluster.
>>>         I realized that for a better performance, it is a good idea
>>>         to use Parquet files. I have 2 questions regarding that:
>>>
>>>         1) If I wanna use Spark SQL against  *partitioned &
>>>         bucketed* tables with Parquet format in Hive, does the
>>>         provided spark binary on the apache website support that or
>>>         do I need to build a new spark binary with some additional
>>>         flags ? (I found a note
>>>         <https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables>
in
>>>         the documentation about enabling Hive support, but I could
>>>         not fully get it as what the correct way of building is, if
>>>         I need to build)
>>         Yes, Hive support is enabled by default now for the binaries
>>         on the website. However, currently Spark SQL doesn't support
>>         buckets yet.
>>>
>>>         2) Does running Spark SQL against tables in Hive downgrade
>>>         the performance, and it is better that I load parquet files
>>>         directly to HDFS or having Hive in the picture is harmless ?
>>         If you're using Parquet, then it should be fine since by
>>         default Spark SQL uses its own native Parquet support to read
>>         Parquet Hive tables.
>>>
>>>         Thnx
>>>
>>
>>
>
>


Mime
View raw message