The best way to do this is to do rdd.mapPartitions() and pass a
function that will open a JDBC connection to your database and write
the range in each partition.
On the input path there is something called JDBC-RDD that is relevant:
On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas
> My fellow welders,
> (Can we make that a thing? Let's make that a thing. :)
> I'm trying to wedge Spark into an existing model where we process and
> transform some data and then load it into an MPP database. I know that part
> of the sell of Spark and Shark is that you shouldn't have to copy data
> around like this, so please bear with me. :)
> Say I have an RDD of about 10GB in size that's cached in memory. What is the
> best/fastest way to push that data into an MPP database like Redshift? Has
> anyone done something like this?
> I'm assuming that pushing the data straight from memory into the database is
> much faster than writing the RDD to HDFS and then COPY-ing it from there
> into the database.
> Is there, for example, a way to perform a bulk load into the database that
> runs on each partition of the in-memory RDD in parallel?
> View this message in context: best practices for pushing an RDD into a
> Sent from the Apache Spark User List mailing list archive at Nabble.com.