spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <philip.og...@oracle.com>
Subject Writing an RDD to Hive
Date Sat, 07 Dec 2013 01:06:00 GMT
I have a simple scenario that I'm struggling to implement.  I would like 
to take a fairly simple RDD generated from a large log file, perform 
some transformations on it, and write the results out such that I can 
perform a Hive query either from Hive (via Hue) or Shark.  I'm having 
troubles with the last step.  I am able to write my data out to HDFS and 
then execute a Hive create table statement followed by a load data 
statement as a separate step.  I really dislike this separate manual 
step and would like to be able to have it all accomplished in my Spark 
application.  To this end, I have investigated two possible approaches 
as detailed below - it's probably too much information so I'll ask my 
more basic question first:

Does anyone have a basic recipe/approach for loading data in an RDD to a 
Hive table from a Spark application?

1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset. There is 
a nice detailed email on how to do this here 
<http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCACyZca3ASKwD-tuJHQi1805BN7ScTguAoRuHd5xTxCSUL1aNvQ@mail.gmail.com%3E>.

I didn't get very far thought because as soon as I added an hbase 
dependency (corresponding to the version of hbase we are running) to my 
pom.xml file, I had an slf4j dependency conflict that caused my current 
application to explode.  I tried the latest released version and the 
slf4j dependency problem went away but then the deprecated class 
TableOutputFormat no longer exists.  Even if loading the data into hbase 
were trivially easy (and the detailed email suggests otherwise) I would 
then need to query HBase from Hive which seems a little clunky.

2) So, I decided that Shark might be an easier option.  All the examples 
provided in their documentation seem to assume that you are using Shark 
as an interactive application from a shell.  Various threads I've seen 
seem to indicate that Shark isn't really intended to be used as 
dependency in your Spark code (see this 
<https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion> 
and that 
<https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion>.) 
It follows then that one can't add a Shark dependency to a pom.xml file 
because Shark isn't released via Maven Central (that I can tell.... 
perhaps it's in some other repo?)  Of course, there are ways of creating 
a local dependency in maven but it starts to feel very hacky.

I realize that I've given sufficient detail to expose my ignorance in a 
myriad of ways.  Please feel free to shine light on any of my 
misconceptions!

Thanks,
Philip


Mime
View raw message