spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <>
Subject Re: Writing an RDD to Hive
Date Wed, 11 Dec 2013 00:26:28 GMT
I uncovered a fairly simple solution that I thought I would share for 
the curious.  Hive provides a JDBC driver/client 

which can be used to execute Hive statements (in my case to drop and 
create tables) from Java/Scala code.  So, I execute a create table 
statement and then write my RDD in tab delimited form to the hdfs 
directory specified in the table create statement.  It was really easy 
to code up after I connected the dots (it seems obvious now!)  The only 
hiccup I ran into was caused by trying to use the wrong hive 
dependency.  In my case we have a CDH4 cluster and so it worked when I 
added the following to my pom file:


On 12/6/2013 6:06 PM, Philip Ogren wrote:
> I have a simple scenario that I'm struggling to implement.  I would 
> like to take a fairly simple RDD generated from a large log file, 
> perform some transformations on it, and write the results out such 
> that I can perform a Hive query either from Hive (via Hue) or Shark.  
> I'm having troubles with the last step.  I am able to write my data 
> out to HDFS and then execute a Hive create table statement followed by 
> a load data statement as a separate step.  I really dislike this 
> separate manual step and would like to be able to have it all 
> accomplished in my Spark application.  To this end, I have 
> investigated two possible approaches as detailed below - it's probably 
> too much information so I'll ask my more basic question first:
> Does anyone have a basic recipe/approach for loading data in an RDD to 
> a Hive table from a Spark application?

View raw message