spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject best practices for pushing an RDD into a database
Date Fri, 14 Mar 2014 02:57:06 GMT
Thank you for the suggestions. I will look into both and report back.

I'm looking at potentially a third option in Redshift's ability to COPY
from SSH:

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Is there some relatively straightforward way a command sent via SSH to a
worker node can yield all the data in the partition of an RDD that is
resident on that node? (Sounds unlikely.)

Nick


2014년 3월 13일 목요일, Sandy
Ryza<sandy.ryza@cloudera.com<javascript:_e(%7B%7D,'cvml','sandy.ryza@cloudera.com');>>님이
작성한 메시지:

> You can also call rdd.saveAsHadoopDataset and use the DBOutputFormat that
> Hadoop provides:
>
> http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html
>
>
> On Thu, Mar 13, 2014 at 4:17 PM, Patrick Wendell <pwendell@gmail.com>wrote:
>
>> Hey Nicholas,
>>
>> The best way to do this is to do rdd.mapPartitions() and pass a
>> function that will open a JDBC connection to your database and write
>> the range in each partition.
>>
>> On the input path there is something called JDBC-RDD that is relevant:
>>
>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.JdbcRDD
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala#L73
>>
>> - Patrick
>>
>> On Thu, Mar 13, 2014 at 2:05 PM, Nicholas Chammas
>> <nicholas.chammas@gmail.com> wrote:
>> > My fellow welders,
>> >
>> > (Can we make that a thing? Let's make that a thing. :)
>> >
>> > I'm trying to wedge Spark into an existing model where we process and
>> > transform some data and then load it into an MPP database. I know that
>> part
>> > of the sell of Spark and Shark is that you shouldn't have to copy data
>> > around like this, so please bear with me. :)
>> >
>> > Say I have an RDD of about 10GB in size that's cached in memory. What
>> is the
>> > best/fastest way to push that data into an MPP database like Redshift?
>> Has
>> > anyone done something like this?
>> >
>> > I'm assuming that pushing the data straight from memory into the
>> database is
>> > much faster than writing the RDD to HDFS and then COPY-ing it from there
>> > into the database.
>> >
>> > Is there, for example, a way to perform a bulk load into the database
>> that
>> > runs on each partition of the in-memory RDD in parallel?
>> >
>> > Nick
>> >
>> >
>> > ________________________________
>> > View this message in context: best practices for pushing an RDD into a
>> > database
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Mime
View raw message