spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Bradley <jos...@databricks.com>
Subject Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception
Date Tue, 21 Oct 2014 22:54:23 GMT
Hi, this sounds like a bug which has been fixed in the current master.
What version of Spark are you using?  Would it be possible to update to the
current master?
If not, it would be helpful to know some more of the problem dimensions
(num examples, num features, feature types, label type).
Thanks,
Joseph


On Tue, Oct 21, 2014 at 2:42 AM, lokeshkumar <lokesh@dataken.net> wrote:

> Hi All,
>
> I am trying to run the spark example JavaDecisionTree code using some
> external data set.
> It works for certain dataset only with specific maxBins and maxDepth
> settings. Even for a working dataset if I add a new data item I get a
> ArrayIndexOutOfBounds Exception, I get the same exception for the first
> case
> as well (changing maxBins and maxDepth). I am not sure what is wrong here,
> can anyone please explain this.
>
> Exception stacktrace:
>
> 14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage
> 7.0 (TID 13)
> java.lang.ArrayIndexOutOfBoundsException: 6301
>         at
>
> org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
>         at
>
> org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
>         at
>
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
>         at
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>         at
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>         at
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>         at
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at
>
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>         at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>         at
>
> org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
>         at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>         at
>
> org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
>         at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
>         at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
>         at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
>         at
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0
> (TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301
>
>
> org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
>
>
> org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
>
>
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
>
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>         scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>
>
> org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
>
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>
>
> org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
>         org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>         org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>         org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         java.lang.Thread.run(Thread.java:745)
> 14/10/21 13:47:15 ERROR scheduler.TaskSetManager: Task 1 in stage 7.0
> failed
> 1 times; aborting job
> 14/10/21 13:47:15 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 7.0,
> whose tasks have all completed, from pool
> 14/10/21 13:47:15 INFO scheduler.TaskSchedulerImpl: Cancelling stage 7
> 14/10/21 13:47:15 INFO scheduler.DAGScheduler: Failed to run reduce at
> RDDFunctions.scala:111
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in
> stage 7.0 failed 1 times, most recent failure: Lost task 1.0 in stage 7.0
> (TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301
>
>
> org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
>
>
> org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
>
>
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
>
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>
>
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
>
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>
>
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>         scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>
>
> org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
>
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>
>
> org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
>
>
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
>         org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>         org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>         org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>         org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>         org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>         at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>         at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at scala.Option.foreach(Option.scala:236)
>         at
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
>         at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decision-Tree-ArrayIndexOutOfBounds-Exception-tp16907.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message