spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <>
Subject Re: How to make spark partition sticky, i.e. stay with node?
Date Fri, 23 Jan 2015 18:47:24 GMT
Hello mingyu,

That is a reasonable way of doing this. Spark Streaming natively does
not support sticky because Spark launches tasks based on data
locality. If there is no locality (example reduce tasks can run
anywhere), location is randomly assigned. So the cogroup or join
introduces a locality and which forces Spark scheduler to be sticky.
Another way to achieve this is using "updateStateByKey" which
internally uses cogroup, but presents a nicer streaming-like API for
per-key stateful operations.


On Fri, Jan 23, 2015 at 8:23 AM, mingyu <> wrote:
> I found a workaround.
> I can make my auxiliary data a RDD. Partition it and cache it.
> Later, I can cogroup it with other RDDs and Spark will try to keep the
> cached RDD partitions where they are and not shuffle them.
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message