You can also call rdd.saveAsHadoopDataset and use the DBOutputFormat that Hadoop provides:
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html


On Thu, Mar 13, 2014 at 4:17 PM, Patrick Wendell <pwendell@gmail.com> wrote:
Hey Nicholas,

The best way to do this is to do rdd.mapPartitions() and pass a
function that will open a JDBC connection to your database and write
the range in each partition.

On the input path there is something called JDBC-RDD that is relevant:
http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.JdbcRDD
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala#L73

- Patrick

On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas
<nicholas.chammas@gmail.com> wrote:
> My fellow welders,
>
> (Can we make that a thing? Let's make that a thing. :)
>
> I'm trying to wedge Spark into an existing model where we process and
> transform some data and then load it into an MPP database. I know that part
> of the sell of Spark and Shark is that you shouldn't have to copy data
> around like this, so please bear with me. :)
>
> Say I have an RDD of about 10GB in size that's cached in memory. What is the
> best/fastest way to push that data into an MPP database like Redshift? Has
> anyone done something like this?
>
> I'm assuming that pushing the data straight from memory into the database is
> much faster than writing the RDD to HDFS and then COPY-ing it from there
> into the database.
>
> Is there, for example, a way to perform a bulk load into the database that
> runs on each partition of the in-memory RDD in parallel?
>
> Nick
>
>
> ________________________________
> View this message in context: best practices for pushing an RDD into a
> database
> Sent from the Apache Spark User List mailing list archive at Nabble.com.