spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mich Talebzadeh <mich.talebza...@gmail.com>
Subject Re: Hive on Spark vs Spark on Hive(HiveContext)
Date Thu, 01 Jul 2021 11:07:27 GMT
Hi Pralabh,

You need to check the latest compatibility between Spark version that can
successfully work as Hive execution engine

This is my old file alluding to spark-1.3.1 as the execution engine

set spark.home=/data6/hduser/spark-1.3.1-bin-hadoop2.6;
--set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
set spark.master=yarn-client;
set hive.execution.engine=spark;


Hive is great as a data warehouse but the default mapReduce used is
Jurassic Park.

On the other hand Spark has performant inbuilt API for Hive. Otherwise you
can connect to Hive on a remote cluster through JDBC.

In python you can do

from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext


And use it like below


sqltext  = ""
if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1):
  rows = spark.sql(f"""SELECT COUNT(1) FROM
{fullyQualifiedTableName}""").collect()[0][0]
  print ("number of rows is ",rows)
else:
  print("\nTable test.randomDataPy does not exist, creating table ")
  sqltext = """
     CREATE TABLE test.randomDataPy(
       ID INT
     , CLUSTERED INT
     , SCATTERED INT
     , RANDOMISED INT
     , RANDOM_STRING VARCHAR(50)
     , SMALL_VC VARCHAR(50)
     , PADDING  VARCHAR(4000)
    )
    STORED AS PARQUET
    """
  spark.sql(sqltext)

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 11:50, Pralabh Kumar <pralabhkumar@gmail.com> wrote:

> Hi mich
>
> Thx for replying.your answer really helps. The comparison was done in
> 2016. I would like to know the latest comparison with spark 3.0
>
> Also what you are suggesting is to migrate queries to Spark ,which is
> hivecontxt or hive on spark, which is what Facebook also did
> . Is that understanding correct ?
>
> Regards
> Pralabh
>
> On Thu, 1 Jul 2021, 15:44 Mich Talebzadeh, <mich.talebzadeh@gmail.com>
> wrote:
>
>> Hi Prahabh,
>>
>> This question has been asked before :)
>>
>> Few years ago (late 2016),  I made a presentation on running Hive Queries
>> on the Spark execution engine for Hortonworks.
>>
>>
>> https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
>>
>> The issue you will face will be compatibility problems with versions of
>> Hive and Spark.
>>
>> My suggestion would be to use Spark as a massive parallel processing and
>> Hive as a storage layer. However, you need to test what can be migrated or
>> not.
>>
>> HTH
>>
>>
>> Mich
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 1 Jul 2021 at 10:52, Pralabh Kumar <pralabhkumar@gmail.com>
>> wrote:
>>
>>> Hi Dev
>>>
>>> I am having thousands of legacy hive queries .  As a plan to move to
>>> Spark , we are planning to migrate Hive queries on Spark .  Now there are
>>> two approaches
>>>
>>>
>>>    1.  One is Hive on Spark , which is similar to changing the
>>>    execution engine in hive queries like TEZ.
>>>    2. Another one is migrating hive queries to Hivecontext/sparksql ,
>>>    an approach used by Facebook and presented in Spark conference.
>>>    https://databricks.com/session/experiences-migrating-hive-workload-to-sparksql#:~:text=Spark%20SQL%20in%20Apache%20Spark,SQL%20with%20minimal%20user%20intervention
>>>    .
>>>
>>>
>>> Can you please guide me which option to go for . I am personally
>>> inclined to go for option 2 . It also allows the use of the latest spark .
>>>
>>> Please help me on the same , as there are not much comparisons online
>>> available keeping Spark 3.0 in perspective.
>>>
>>> Regards
>>> Pralabh Kumar
>>>
>>>
>>>

Mime
View raw message