spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Pirz <james.p...@gmail.com>
Subject Re: Running SparkSql against Hive tables
Date Tue, 09 Jun 2015 17:55:57 GMT
I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on
Hadoop 2.6.
I looked the ThriftServer2 logs, and I realized that the server was not
starting properly, because of failure in creating a server socket. In fact,
I had passed the URI to my Hiveserver2 service, launched from Hive, and the
beeline in Spark was directly talking to Hive's hiveserver2 and it was just
using it as a Hive service.

I could fix starting the Thriftserver2 in Spark (by changing port), but I
guess the missing puzzle piece for me is: How does Spark SQL re-uses the
already created table in Hive ? I mean do I have to write an application
that uses HiveContext to do that and submit it to Spark for execution, or
is there a way to run SQL scripts directly via command line (in distributed
mode and on the cluster) - (Just similar to the way that one would use Hive
(or Shark) command line by passing a query file with -f flag). Looking at
the Spark SQL documentation, it seems that it is possible. Please correct
me if I am wrong.

On Mon, Jun 8, 2015 at 6:56 PM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

>
> On 6/9/15 8:42 AM, James Pirz wrote:
>
> Thanks for the help!
> I am actually trying Spark SQL to run queries against tables that I've
> defined in Hive.
>
>  I follow theses steps:
> - I start hiveserver2 and in Spark, I start Spark's Thrift server by:
> $SPARK_HOME/sbin/start-thriftserver.sh --master
> spark://spark-master-node-ip:7077
>
>  - and I start beeline:
> $SPARK_HOME/bin/beeline
>
>  - In my beeline session, I connect to my running hiveserver2
> !connect jdbc:hive2://hive-node-ip:10000
>
>  and I can run queries successfully. But based on hiveserver2 logs, It
> seems it actually uses "Hadoop's MR" to run queries,  *not* Spark's
> workers. My goals is to access Hive's tables' data, but run queries through
> Spark SQL using Spark workers (not Hadoop).
>
> Hm, interesting. HiveThriftServer2 should never issue MR jobs to perform
> queries. I did receive two reports in the past which also say MR jobs
> instead of Spark jobs were issued to perform the SQL query. However, I only
> reproduced this issue in a rare corner case, which uses HTTP mode to
> connect to Hive 0.12.0. Apparently this isn't your case. Would you mind to
> provide more details so that I can dig in?  The following information would
> be very helpful:
>
> 1. Hive version
> 2. A copy of your hive-site.xml
> 3. Hadoop version
> 4. Full HiveThriftServer2 log (which can be found in $SPARK_HOME/logs)
>
> Thanks in advance!
>
>
>  Is it possible to do that via Spark SQL (its CLI) or through its thrift
> server ? (I tried to find some basic examples in the documentation, but I
> was not able to) - Any suggestion or hint on how I can do that would be
> highly appreciated.
>
>  Thnx
>
> On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian <lian.cs.zju@gmail.com> wrote:
>
>>
>>
>> On 6/6/15 9:06 AM, James Pirz wrote:
>>
>> I am pretty new to Spark, and using Spark 1.3.1, I am trying to use
>> 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a
>> better performance, it is a good idea to use Parquet files. I have 2
>> questions regarding that:
>>
>>  1) If I wanna use Spark SQL against  *partitioned & bucketed* tables
>> with Parquet format in Hive, does the provided spark binary on the apache
>> website support that or do I need to build a new spark binary with some
>> additional flags ? (I found a note
>> <https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables>
in
>> the documentation about enabling Hive support, but I could not fully get it
>> as what the correct way of building is, if I need to build)
>>
>>  Yes, Hive support is enabled by default now for the binaries on the
>> website. However, currently Spark SQL doesn't support buckets yet.
>>
>>
>>  2) Does running Spark SQL against tables in Hive downgrade the
>> performance, and it is better that I load parquet files directly to HDFS or
>> having Hive in the picture is harmless ?
>>
>>  If you're using Parquet, then it should be fine since by default Spark
>> SQL uses its own native Parquet support to read Parquet Hive tables.
>>
>>
>>  Thnx
>>
>>
>>
>
>

Mime
View raw message