spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Spark streaming and executor object reusage
Date Sat, 07 Mar 2015 12:45:00 GMT
In the example with "createNewConnection()", a connection is created
for every partition of every batch of input. You could take the idea
further and share connections across partitions or batches. This
requires them to have a lifecycle beyond foreachRDD. That's
accomplishable with some kind of static / singleton connection,
presumably connection pool.

The pool would be per JVM, which means per executor. Although you're
not guaranteed that this same executor would process many partitions
of an RDD, or process a number of batches over time, in practice, both
are true. So a pool can effectively be shared across partitions and
batches.

Spark has no way to police, and therefore can't and doesn't, reset any
state that you happen to create and use in your code.

An executor is per app though so would not be shared with another
streaming job, no.

On Sat, Mar 7, 2015 at 1:32 AM, Jean-Pascal Billaud <jp@tellapart.com> wrote:
> Hi,
>
> Reading through the Spark Streaming Programming Guide, I read in the "Design
> Patterns for using foreachRDD":
>
> "Finally, this can be further optimized by reusing connection objects across
> multiple RDDs/batches.
> One can maintain a static pool of connection objects than can be reused as
> RDDs of multiple batches are pushed to the external system"
>
> I have this connection pool that might be more or less heavy to instantiate.
> I don't use it as part of a foreachRDD but as part of regular map operations
> to query some api service. I'd like to understand what "multiple batches"
> means here. Is this across RDDs on a single DStream? Across multiple
> DStreams?
>
> I'd like to understand what's the context sharability across DStreams over
> time. Is it expected that the executor initializing my Factory will keep
> getting batches from my streaming job while using the same singleton
> connection pool over and over? Or Spark resets executors states after each
> DStream is completed to allocated executors to other streaming job
> potentially?
>
> Thanks,

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message