spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manish Malhotra <manish.malhotra.w...@gmail.com>
Subject Re: Spark Streaming: question on sticky session across batches ?
Date Tue, 15 Nov 2016 21:42:37 GMT
Thanks!
On Tue, Nov 15, 2016 at 1:07 AM Takeshi Yamamuro <linguin.m.s@gmail.com>
wrote:

> - dev
>
> Hi,
>
> AFAIK, if you use RDDs only, you can control the partition mapping to some
> extent
> by using a partition key RDD[(key, data)].
> A defined partitioner distributes data into partitions depending on the
> key.
> As a good example to control partitions, you can see the GraphX code;
>
> https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L291
>
> GraphX holds `PartitionId` in edge RDDs to control the partition where
> edge data are.
>
> // maropu
>
>
> On Tue, Nov 15, 2016 at 5:19 AM, Manish Malhotra <
> manish.malhotra.work@gmail.com> wrote:
>
> sending again.
> any help is appreciated !
>
> thanks in advance.
>
> On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra <
> manish.malhotra.work@gmail.com> wrote:
>
> Hello Spark Devs/Users,
>
> Im trying to solve the use case with Spark Streaming 1.6.2 where for every
> batch ( say 2 mins) data needs to go to the same reducer node after
> grouping by key.
> The underlying storage is Cassandra and not HDFS.
>
> This is a map-reduce job, where also trying to use the partitions of the
> Cassandra table to batch the data for the same partition.
>
> The requirement of sticky session/partition across batches is because the
> operations which we need to do, needs to read data for every key and then
> merge this with the current batch aggregate values. So, currently when
> there is no stickyness across batches, we have to read for every key, merge
> and then write back. and reads are very expensive. So, if we have sticky
> session, we can avoid read in every batch and have a cache of till last
> batch aggregates across batches.
>
> So, there are few options, can think of:
>
> 1. to change the TaskSchedulerImpl, as its using Random to identify the
> node for mapper/reducer before starting the batch/phase.
> Not sure if there is a custom scheduler way of achieving it?
>
> 2. Can custom RDD can help to find the node for the key-->node.
> there is a getPreferredLocation() method.
> But not sure, whether this will be persistent or can vary for some edge
> cases?
>
> Thanks in advance for you help and time !
>
> Regards,
> Manish
>
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

Mime
View raw message