spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: distinct on huge dataset
Date Sat, 22 Mar 2014 06:42:58 GMT
Which version of spark are you running?


On Fri, Mar 21, 2014 at 10:45 PM, Kane <kane.isturm@gmail.com> wrote:

> I have a huge 2.5TB file. When i run:
> val t = sc.textFile("/user/hdfs/dump.csv")
> t.distinct.count
>
> It fails right away with a lot of:
>
> Loss was due to java.lang.ArrayIndexOutOfBoundsException
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:18)
>         at $line59.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:16)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>         at
> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>         at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>         at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:94)
>         at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>         at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:471)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>         at org.apache.spark.scheduler.Task.run(Task.scala:53)
>         at
>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>         at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46)
>         at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>         at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message