spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <...@preferred.jp>
Subject Re: serialization issue with mapPartitions
Date Fri, 26 Dec 2014 01:08:43 GMT
Hi,

On Fri, Dec 26, 2014 at 1:32 AM, ey-chih chow <eychih@hotmail.com> wrote:
>
> I got some issues with mapPartitions with the following piece of code:
>
>     val sessions = sc
>       .newAPIHadoopFile(
>         "... path to an avro file ...",
>         classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[ByteBuffer]],
>         classOf[AvroKey[ByteBuffer]],
>         classOf[NullWritable],
>         job.getConfiguration())
>       .mapPartitions { valueIterator =>
>         val config = job.getConfiguration()
>                          .
>       }
>       .collect()
>
> Why job.getConfiguration() in the function mapPartitions will generate the
> following message?
>
> Cause: java.io.NotSerializableException: org.apache.hadoop.mapreduce.Job
>

The functions inside mapPartitions() will be executed on the Spark
executors, not the Spark driver. Therefore, the function body needs to be
serialized and sent to the executors via network. If that is not possible
(in your case, `job` cannot be serialized), you will get a
NotSerializableException. It works inside newAPIHadoopFile because this is
executed on the driver.

Tobias

Mime
View raw message