spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zsolt Tóth <toth.zsolt....@gmail.com>
Subject Map and MapParitions with partition-local variable
Date Wed, 16 Nov 2016 11:59:09 GMT
Hi,

I need to run a map() and a mapPartitions() on my input DF. As a
side-effect of the map(), a partition-local variable should be updated,
that is used in the mapPartitions() afterwards.
I can't use Broadcast variable, because it's shared between partitions on
the same executor.

Where can I define this variable?
I could run a single mapPartitions() that defines the variable, iterates
over the input (just as the map() would do), collect the result into an
ArrayList, and then use the list's iterator (and the updated
partition-local variable) as the input of the transformation that the
original mapPartitions() did.

It feels however, that this is not as optimal as running
map()+mapPartitions() because I need to store the ArrayList (which is
basically the whole data in the partition) in memory.

Thanks,
Zsolt

Mime
View raw message