We have a similar use case. We use the DataFrame API to cache data out of Hive tables, and then run pretty complex scripts on them. You can register your Hive UDFs to be used within Spark SQL statements if you want.

Something like this:

sqlContext.sql("CREATE TEMPORARY FUNCTION <udf_name> as '<udf class>'")

If you had a table called Prices in the Stocks Hive db, you could do this:

val pricesDf = sqlContext.table("Stocks.Prices")

Then, you can run whatever SQL you really want on the pricesDf.

sqlContext.sql("select udf_name(), ..... from tmp_prices")

There are a lot of SQL functions available:


I hope that helps.


On Thu, Mar 9, 2017 at 2:28 AM, nancy henry <nancyhenry6542@gmail.com> wrote:
okay what is difference between keep set hive.execution.engine =spark
running the script through hivecontext.sql

Show quoted text

On Mar 9, 2017 8:52 AM, "ayan guha" <guha.ayan@gmail.com> wrote:

Subject to your version of Hive & Spark, you may want to set hive.execution.engine=spark as beeline command line parameter, assuming you are running hive scripts using beeline command line (which is suggested practice for security purposes). 


On Thu, Mar 9, 2017 at 2:09 PM, nancy henry <nancyhenry6542@gmail.com> wrote:

Hi Team,

basically we have all data as hive tables ..and processing it till now in hive on MR.. now that we have hivecontext which can run hivequeries on spark, we are making all these complex hive scripts to run using hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically running hive queries on spark and not coding anything yet in scala still we see just making hive queries to run on spark is showing a lot difference in time than run on MR..

so as we already have hivescripts lets make those complex hivescript run using hc.sql as hc.sql is able to do it

or is this not best practice even though spark can do it its still better to load all those individual hive tables in spark and make rdds and write scala code to get the same functionality happening in hive

its becoming difficult for us to choose whether to leave it to hc.sql to do the work of running complex scripts also or we have to code in scala..will it be worth the effort of manual intervention in terms of performance 

ex of our sample scripts
use db;
create tempfunction1 as com.fgh.jkl.TestFunction;

create destable in hive;
insert overwrite desttable select (big complext transformations and usage of hive udf)
from table1,table2,table3 join table4 on some condition complex and join table 7 on another complex condition where complex filtering

So please help what would be best approach and why i should not give entire script for hivecontext to make its own rdds and run on spark if we are able to do it

coz all examples i see online are only showing hc.sql("select * from table1) and nothing complex than that

Best Regards,
Ayan Guha