spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <philip.og...@oracle.com>
Subject Re: Writing an RDD to Hive
Date Wed, 11 Dec 2013 00:26:28 GMT
I uncovered a fairly simple solution that I thought I would share for 
the curious.  Hive provides a JDBC driver/client 
<https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC>

which can be used to execute Hive statements (in my case to drop and 
create tables) from Java/Scala code.  So, I execute a create table 
statement and then write my RDD in tab delimited form to the hdfs 
directory specified in the table create statement.  It was really easy 
to code up after I connected the dots (it seems obvious now!)  The only 
hiccup I ran into was caused by trying to use the wrong hive 
dependency.  In my case we have a CDH4 cluster and so it worked when I 
added the following to my pom file:

         <repository>
             <id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
         </repository>
...
         <dependency>
             <groupId>org.apache.hive</groupId>
             <artifactId>hive-jdbc</artifactId>
             <version>0.10.0-cdh4.3.2</version>
         </dependency>


On 12/6/2013 6:06 PM, Philip Ogren wrote:
> I have a simple scenario that I'm struggling to implement.  I would 
> like to take a fairly simple RDD generated from a large log file, 
> perform some transformations on it, and write the results out such 
> that I can perform a Hive query either from Hive (via Hue) or Shark.  
> I'm having troubles with the last step.  I am able to write my data 
> out to HDFS and then execute a Hive create table statement followed by 
> a load data statement as a separate step.  I really dislike this 
> separate manual step and would like to be able to have it all 
> accomplished in my Spark application.  To this end, I have 
> investigated two possible approaches as detailed below - it's probably 
> too much information so I'll ask my more basic question first:
>
> Does anyone have a basic recipe/approach for loading data in an RDD to 
> a Hive table from a Spark application?


Mime
View raw message