spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oksana Romankova (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-8697) MatchIterator not serializable exception in RegexTokenizer
Date Tue, 26 Jan 2016 19:23:40 GMT

    [ https://issues.apache.org/jira/browse/SPARK-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117789#comment-15117789
] 

Oksana Romankova commented on SPARK-8697:
-----------------------------------------

Spark 1.4.1

It seems like the issue happens when DataFrame is created frm existing RDD using toDF() and
if RegexTokenizer is used to extract matches with setGaps(false). If you load the file from
sqlContext.read.load this doesn't happen.

The exception is:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0.0 in stage 2.0 (TID 2) had a not serializable result: scala.util.matching.Regex$MatchIterator
Serialization stack:

	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


> MatchIterator not serializable exception in RegexTokenizer
> ----------------------------------------------------------
>
>                 Key: SPARK-8697
>                 URL: https://issues.apache.org/jira/browse/SPARK-8697
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Xiangrui Meng
>            Priority: Minor
>
> I'm not sure whether this is a real bug or not. In REPL, I saw MatchIterator not serializable
exception in RegexTokeinzer during some ad-hoc testing. However, I couldn't reproduce this
issue. Maybe it is a REPL bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message