spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zsolt Tóth <toth.zsolt....@gmail.com>
Subject Map and MapParitions with partition-local variable
Date Thu, 17 Nov 2016 20:57:37 GMT
Any comment on this one?

2016. nov. 16. du. 12:59 ezt írta ("Zsolt Tóth" <toth.zsolt.bme@gmail.com>):

> Hi,
>
> I need to run a map() and a mapPartitions() on my input DF. As a
> side-effect of the map(), a partition-local variable should be updated,
> that is used in the mapPartitions() afterwards.
> I can't use Broadcast variable, because it's shared between partitions on
> the same executor.
>
> Where can I define this variable?
> I could run a single mapPartitions() that defines the variable, iterates
> over the input (just as the map() would do), collect the result into an
> ArrayList, and then use the list's iterator (and the updated
> partition-local variable) as the input of the transformation that the
> original mapPartitions() did.
>
> It feels however, that this is not as optimal as running
> map()+mapPartitions() because I need to store the ArrayList (which is
> basically the whole data in the partition) in memory.
>
> Thanks,
> Zsolt
>

Mime
View raw message