spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <>
Subject best practices for pushing an RDD into a database
Date Thu, 13 Mar 2014 21:05:13 GMT
My fellow welders <>,

(Can we make that a thing? Let's make that a thing. :)

I'm trying to wedge Spark into an existing model where we process and
transform some data and then load it into an MPP database. I know that part
of the sell of Spark and Shark is that you shouldn't have to copy data
around like this, so please bear with me. :)

Say I have an RDD of about 10GB in size that's cached in memory. What is
the best/fastest way to push that data into an MPP database like
Has anyone done something like this?

I'm assuming that pushing the data straight from memory into the database
is much faster than writing the RDD to HDFS and then COPY-ing it from there
into the database.

Is there, for example, a way to perform a bulk load into the database that
runs on each partition of the in-memory RDD in parallel?


View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message