spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: foreachRDD vs. forearchPartition ?
Date Wed, 08 Jul 2015 23:23:22 GMT
This is also discussed in the programming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg <dgoldenberg123@gmail.com>
wrote:

> Thanks, Sean.
>
> "are you asking about foreach vs foreachPartition? that's quite
> different. foreachPartition does not give more parallelism but lets
> you operate on a whole batch of data at once, which is nice if you
> need to allocate some expensive resource to do the processing"
>
> This is basically what I was looking for.
>
>
> On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen <sowen@cloudera.com> wrote:
>
>> @Evo There is no foreachRDD operation on RDDs; it is a method of
>> DStream. It gives each RDD in the stream. RDD has a foreach, and
>> foreachPartition. These give elements of an RDD. What do you mean it
>> 'works' to call foreachRDD on an RDD?
>>
>> @Dmitry are you asking about foreach vs foreachPartition? that's quite
>> different. foreachPartition does not give more parallelism but lets
>> you operate on a whole batch of data at once, which is nice if you
>> need to allocate some expensive resource to do the processing.
>>
>> On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
>> <dgoldenberg123@gmail.com> wrote:
>> > "These are quite different operations. One operates on RDDs in  DStream
>> and
>> > one operates on partitions of an RDD. They are not alternatives."
>> >
>> > Sean, different operations as they are, they can certainly be used on
>> the
>> > same data set.  In that sense, they are alternatives. Code can be
>> written
>> > using one or the other which reaches the same effect - likely at a
>> different
>> > efficiency cost.
>> >
>> > The question is, what are the effects of applying one vs. the other?
>> >
>> > My specific scenario is, I'm streaming data out of Kafka.  I want to
>> perform
>> > a few transformations then apply an action which results in e.g. writing
>> > this data to Solr.  According to Evo, my best bet is foreachPartition
>> > because of increased parallelism (which I'd need to grok to understand
>> the
>> > details of what that means).
>> >
>> > Another scenario is, I've done a few transformations and send a result
>> > somewhere, e.g. I write a message into a socket.  Let's say I have one
>> > socket per a client of my streaming app and I get a host:port of that
>> socket
>> > as part of the message and want to send the response via that socket.
>> Is
>> > foreachPartition still a better choice?
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen <sowen@cloudera.com> wrote:
>> >>
>> >> These are quite different operations. One operates on RDDs in  DStream
>> and
>> >> one operates on partitions of an RDD. They are not alternatives.
>> >>
>> >>
>> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg <dgoldenberg123@gmail.com>
>> wrote:
>> >>>
>> >>> Is there a set of best practices for when to use foreachPartition vs.
>> >>> foreachRDD?
>> >>>
>> >>> Is it generally true that using foreachPartition avoids some of the
>> >>> over-network data shuffling overhead?
>> >>>
>> >>> When would I definitely want to use one method vs. the other?
>> >>>
>> >>> Thanks.
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >>> For additional commands, e-mail: user-help@spark.apache.org
>> >>>
>> >
>>
>
>

Mime
View raw message