spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AssafMendelson <assaf.mendel...@rsa.com>
Subject IndexOutOfBoundException in catalyst when doing multiple approxDistinctCount
Date Tue, 08 Aug 2017 15:38:25 GMT
Hi,

I am doing a large number of aggregations on a dataframe (without groupBy) to get some statistics.
As part of this I am doing an approx_count_distinct(c, 0.01)
Everything works fine but when I do the same aggregation a second time (for each column) I
get the following error:



[Stage 2:>                                                          (0 + 2) / 2][WARN]
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Error calculating stats of
compiled class.
java.lang.IndexOutOfBoundsException: Index: 4355, Size: 1
                at java.util.ArrayList.rangeCheck(ArrayList.java:653)
                at java.util.ArrayList.get(ArrayList.java:429)
                at org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
                at org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
                at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
                at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
                at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
                at org.codehaus.janino.util.ClassFile.<init>(ClassFile.java:280)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:996)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:993)
                at scala.collection.Iterator$class.foreach(Iterator.scala:893)
                at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
                at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
                at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:993)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:961)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1027)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1024)
                at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
                at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
                at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
                at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
                at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
                at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
                at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:906)
                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:412)
                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:366)
                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:32)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:890)
                at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:130)
                at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:140)
                at org.apache.spark.sql.execution.aggregate.AggregationIterator.generateResultProjection(AggregationIterator.scala:235)
                at org.apache.spark.sql.execution.aggregate.AggregationIterator.<init>(AggregationIterator.scala:266)
                at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.<init>(SortBasedAggregationIterator.scala:39)
                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
                at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
                at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
                at org.apache.spark.scheduler.Task.run(Task.scala:108)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)
[WARN] org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Error calculating
stats of compiled class.
java.lang.IndexOutOfBoundsException: Index: 768, Size: 1
                at java.util.ArrayList.rangeCheck(ArrayList.java:653)
                at java.util.ArrayList.get(ArrayList.java:429)
                at org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
                at org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
                at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
                at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
                at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
                at org.codehaus.janino.util.ClassFile.<init>(ClassFile.java:280)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:996)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:993)
                at scala.collection.Iterator$class.foreach(Iterator.scala:893)
                at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
                at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
                at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:993)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:961)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1027)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1024)
                at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
                at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
                at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
                at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
                at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
                at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
                at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:906)
                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:194)
                at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
                at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:890)
                at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection$.create(Projection.scala:182)
                at org.apache.spark.sql.catalyst.expressions.FromUnsafeProjection$.apply(Projection.scala:175)
                at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.<init>(SortBasedAggregationIterator.scala:98)
                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
                at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
                at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
                at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
                at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
                at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
                at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
                at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
                at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
                at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
                at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
                at org.apache.spark.scheduler.Task.run(Task.scala:108)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)

Anyone ran into this or knows how to fix it?

Thanks,
              Assaf.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IndexOutOfBoundException-in-catalyst-when-doing-multiple-approxDistinctCount-tp29041.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message