spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Problems with Tungsten in Spark 1.5.0-rc2
Date Tue, 01 Sep 2015 17:07:57 GMT
Thanks for the confirmation. The tungsten-sort is not the default
ShuffleManager, this fix will not block 1.5 release, it may be in
1.5.1.

BTW, How is the difference between sort and tungsten-sort
ShuffleManager for this large job?

On Tue, Sep 1, 2015 at 8:03 AM, Anders Arpteg <arpteg@spotify.com> wrote:
> A fix submitted less than one hour after my mail, very impressive Davies!
> I've compiled your PR and tested it with the large job that failed before,
> and it seems to work fine now without any exceptions. Awesome, thanks!
>
> Best,
> Anders
>
> On Tue, Sep 1, 2015 at 1:38 AM Davies Liu <davies@databricks.com> wrote:
>>
>> I had sent out a PR [1] to fix 2), could you help to test that?
>>
>> [1]  https://github.com/apache/spark/pull/8543
>>
>> On Mon, Aug 31, 2015 at 12:34 PM, Anders Arpteg <arpteg@spotify.com>
>> wrote:
>> > Was trying out 1.5 rc2 and noticed some issues with the Tungsten shuffle
>> > manager. One problem was when using the com.databricks.spark.avro reader
>> > and
>> > the error(1) was received, see stack trace below. The problem does not
>> > occur
>> > with the "sort" shuffle manager.
>> >
>> > Another problem was in a large complex job with lots of transformations
>> > occurring simultaneously, i.e. 50+ or more maps each shuffling data.
>> > Received error(2) about inability to acquire memory which seems to also
>> > have
>> > to do with Tungsten. Possibly some setting available to increase that
>> > memory, because there's lots of heap memory available.
>> >
>> > Am running on Yarn 2.2 with about 400 executors. Hoping this will give
>> > some
>> > hints for improving the upcoming release, or for me to get some hints to
>> > fix
>> > the problems.
>> >
>> > Thanks,
>> > Anders
>> >
>> > Error(1)
>> >
>> > 15/08/31 18:30:57 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
>> > 3387,
>> > lon4-hadoopslave-c245.lon4.spotify.net): java.io.EOFException
>> >
>> >        at java.io.DataInputStream.readInt(DataInputStream.java:392)
>> >
>> >        at
>> >
>> > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:121)
>> >
>> >        at
>> >
>> > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:109)
>> >
>> >        at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>> >
>> >        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >
>> >        at
>> >
>> > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>> >
>> >        at
>> >
>> > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>> >
>> >        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >
>> >        at
>> >
>> > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:366)
>> >
>> >        at
>> >
>> > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>> >
>> >        at
>> >
>> > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$Tung
>> >
>> > stenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>> >
>> >        at
>> >
>> > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>> >
>> >        at
>> >
>> > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>> >
>> >        at
>> >
>> > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:47)
>> >
>> >        at
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> >
>> >        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> >
>> >        at
>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> >
>> >        at
>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> >
>> >        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> >
>> >        at
>> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> >
>> >        at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> >
>> >        at
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> >
>> >        at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >
>> >        at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >
>> >        at java.lang.Thread.run(Thread.java:745)
>> >
>> >
>> > Error(2)
>> >
>> > 5/08/31 18:41:25 WARN TaskSetManager: Lost task 16.1 in stage 316.0 (TID
>> > 32686, lon4-hadoopslave-b925.lon4.spotify.net): java.io.IOException:
>> > Unable
>> > to acquire 67108864 bytes of memory
>> >
>> >        at
>> >
>> > org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.acquireNewPageIfNecessary(UnsafeShuffleExternalSorter.java:385)
>> >
>> >        at
>> >
>> > org.apache.spark.shuffle.unsafe.UnsafeShuffleExternalSorter.insertRecord(UnsafeShuffleExternalSorter.java:435)
>> >
>> >        at
>> >
>> > org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
>> >
>> >        at
>> >
>> > org.apache.spark.shuffle.unsafe.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:174)
>> >
>> >        at
>> >
>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> >
>> >        at
>> >
>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> >
>> >        at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> >
>> >        at
>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> >
>> >        at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >
>> >        at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >
>> >        at java.lang.Thread.run(Thread.java:745)

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message