spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject RE: Pass parameters to RDD functions
Date Thu, 03 Jul 2014 12:21:52 GMT
Thanks, this works both with Scala and Java Serializable. Which one should I use?

Related question: I would like only the particular val to be used instead of the whole class,
what should I do?
As far as I understand, the whole class is serialized and transferred between nodes (am I
right?)

Alexander

-----Original Message-----
From: Sean Owen [mailto:sowen@cloudera.com] 
Sent: Thursday, July 03, 2014 3:31 PM
To: dev@spark.apache.org
Subject: Re: Pass parameters to RDD functions

Declare this class with "extends Serializable", meaning java.io.Serializable?

On Thu, Jul 3, 2014 at 12:24 PM, Ulanov, Alexander <alexander.ulanov@hp.com> wrote:
> Hi,
>
> I wonder how I can pass parameters to RDD functions with closures. If I do it in a following
way, Spark crashes with NotSerializableException:
>
> class TextToWordVector(csvData:RDD[Array[String]]) {
>
>   val n = 1
>   lazy val x = csvData.map{ stringArr => stringArr(n)}.collect() }
>
> Exception:
> Job aborted due to stage failure: Task not serializable: 
> java.io.NotSerializableException: 
> org.apache.spark.mllib.util.TextToWordVector
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable:
java.io.NotSerializableException: org.apache.spark.mllib.util.TextToWordVector
>                 at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAG
> Scheduler$$failJobAndIndependentStages(DAGScheduler.scala:1038)
>
>
> This message proposes a workaround, but it didn't work for me:
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAA
> _qdLrxXzwXd5=6SXLOgSmTTorpOADHjnOXn=tMrOLEJM=Frw@mail.gmail.com%3E
>
> What is the best practice?
>
> Best regards, Alexander
Mime
View raw message