spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subhash Sriram <subhash.sri...@gmail.com>
Subject Re: spark-sql use case beginner question
Date Thu, 09 Mar 2017 14:48:17 GMT
We have a similar use case. We use the DataFrame API to cache data out of
Hive tables, and then run pretty complex scripts on them. You can register
your Hive UDFs to be used within Spark SQL statements if you want.

Something like this:

sqlContext.sql("CREATE TEMPORARY FUNCTION <udf_name> as '<udf class>'")

If you had a table called Prices in the Stocks Hive db, you could do this:

val pricesDf = sqlContext.table("Stocks.Prices")
pricesDf.createOrReplaceTempView("tmp_prices")

Then, you can run whatever SQL you really want on the pricesDf.

sqlContext.sql("select udf_name(), ..... from tmp_prices")

There are a lot of SQL functions available:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

I hope that helps.

Thanks,
Subhash

On Thu, Mar 9, 2017 at 2:28 AM, nancy henry <nancyhenry6542@gmail.com>
wrote:

> okay what is difference between keep set hive.execution.engine =spark
> and
> running the script through hivecontext.sql
>
> Show quoted text
>
>
> On Mar 9, 2017 8:52 AM, "ayan guha" <guha.ayan@gmail.com> wrote:
>
>> Hi
>>
>> Subject to your version of Hive & Spark, you may want to set
>> hive.execution.engine=spark as beeline command line parameter, assuming you
>> are running hive scripts using beeline command line (which is suggested
>> practice for security purposes).
>>
>>
>>
>> On Thu, Mar 9, 2017 at 2:09 PM, nancy henry <nancyhenry6542@gmail.com>
>> wrote:
>>
>>>
>>> Hi Team,
>>>
>>> basically we have all data as hive tables ..and processing it till now
>>> in hive on MR.. now that we have hivecontext which can run hivequeries on
>>> spark, we are making all these complex hive scripts to run using
>>> hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
>>> running hive queries on spark and not coding anything yet in scala still we
>>> see just making hive queries to run on spark is showing a lot difference in
>>> time than run on MR..
>>>
>>> so as we already have hivescripts lets make those complex hivescript run
>>> using hc.sql as hc.sql is able to do it
>>>
>>> or is this not best practice even though spark can do it its still
>>> better to load all those individual hive tables in spark and make rdds and
>>> write scala code to get the same functionality happening in hive
>>>
>>> its becoming difficult for us to choose whether to leave it to hc.sql to
>>> do the work of running complex scripts also or we have to code in
>>> scala..will it be worth the effort of manual intervention in terms of
>>> performance
>>>
>>> ex of our sample scripts
>>> use db;
>>> create tempfunction1 as com.fgh.jkl.TestFunction;
>>>
>>> create destable in hive;
>>> insert overwrite desttable select (big complext transformations and
>>> usage of hive udf)
>>> from table1,table2,table3 join table4 on some condition complex and join
>>> table 7 on another complex condition where complex filtering
>>>
>>> So please help what would be best approach and why i should not give
>>> entire script for hivecontext to make its own rdds and run on spark if we
>>> are able to do it
>>>
>>> coz all examples i see online are only showing hc.sql("select * from
>>> table1) and nothing complex than that
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>

Mime
View raw message