spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject best practices for pushing an RDD into a database
Date Thu, 13 Mar 2014 21:05:13 GMT
My fellow welders <https://www.google.com/search?q=welding+sparks&tbm=isch>,

(Can we make that a thing? Let's make that a thing. :)

I'm trying to wedge Spark into an existing model where we process and
transform some data and then load it into an MPP database. I know that part
of the sell of Spark and Shark is that you shouldn't have to copy data
around like this, so please bear with me. :)

Say I have an RDD of about 10GB in size that's cached in memory. What is
the best/fastest way to push that data into an MPP database like
Redshift<http://aws.amazon.com/redshift/>?
Has anyone done something like this?

I'm assuming that pushing the data straight from memory into the database
is much faster than writing the RDD to HDFS and then COPY-ing it from there
into the database.

Is there, for example, a way to perform a bulk load into the database that
runs on each partition of the in-memory RDD in parallel?

Nick




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/best-practices-for-pushing-an-RDD-into-a-database-tp2681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message