spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hans van den Bogert (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
Date Tue, 06 Oct 2015 13:46:27 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944925#comment-14944925
] 

Hans van den Bogert edited comment on SPARK-10474 at 10/6/15 1:46 PM:
----------------------------------------------------------------------

One more debug println for the calculated cores (in contrast to numCores):
https://gist.github.com/hansbogert/cc2baf3995d4e37270a2

Relevant output (output is the same for fine-grained as well as coarse-grained mesos):
{noformat}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 10:25:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden
by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS
in YARN).
numCores:0
cores:12
1048576
15/10/06 10:25:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id
is not set.
...
{noformat}

The calculated 'cores' is 12, which the amount of cores of the local driver node, however
the total mesos cluster has more than 40 cores. Either way, there is no difference between
fine-grained and coarse grained mode, for this method at least.

/update 
I should've read the logs on the mesos slaves as well, indeed a discrepancy between fine-grained
mode and coarse grained mode:
In fine-grained mode:
{noformat}
head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0009/executors/20151006-105432-84120842-5050-17066-S3/runs/latest/stdout
2numCores:1
cores:1
67108864
{noformat}


And in coarse grained:
{noformat}
head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0010/executors/3/runs/latest/stdout
Registered executor on node326.ib.cluster
Starting task 3
sh -c ' "/var/scratch/vdbogert/src/spark-1.5.1/bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url akka.tcp://sparkDriver@10.141.3.254:56069/user/CoarseGrainedScheduler --executor-id
20151006-105432-84120842-5050-17066-S3 --hostname node326.ib.cluster --cores 8 --app-id 20151006-105432-84120842-5050-17066-0010'
Forked command at 4378
numCores:8
cores:8
16777216
{noformat}

This is probably a different bug specific to Mesos fine-grained mode. My current workaround
is setting the `spark.buffer.pageSize` to the value of 16M which otherwise would also have
been used automatically in the coarse-grained mode.

/update2
Even allocating only 16MB just like in coarse-grained mode (and going even lower to 8MB),
and I'm *still* seeing this popping up. So 


was (Author: hbogert):
One more debug println for the calculated cores (in contrast to numCores):
https://gist.github.com/hansbogert/cc2baf3995d4e37270a2

Relevant output (output is the same for fine-grained as well as coarse-grained mesos):
{noformat}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 10:25:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden
by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS
in YARN).
numCores:0
cores:12
1048576
15/10/06 10:25:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id
is not set.
...
{noformat}

The calculated 'cores' is 12, which the amount of cores of the local driver node, however
the total mesos cluster has more than 40 cores. Either way, there is no difference between
fine-grained and coarse grained mode, for this method at least.

/update 
I should've read the logs on the mesos slaves as well, indeed a discrepancy between fine-grained
mode and coarse grained mode:
In fine-grained mode:
{noformat}
head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0009/executors/20151006-105432-84120842-5050-17066-S3/runs/latest/stdout
2numCores:1
cores:1
67108864
{noformat}


And in coarse grained:
{noformat}
head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0010/executors/3/runs/latest/stdout
Registered executor on node326.ib.cluster
Starting task 3
sh -c ' "/var/scratch/vdbogert/src/spark-1.5.1/bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend
--driver-url akka.tcp://sparkDriver@10.141.3.254:56069/user/CoarseGrainedScheduler --executor-id
20151006-105432-84120842-5050-17066-S3 --hostname node326.ib.cluster --cores 8 --app-id 20151006-105432-84120842-5050-17066-0010'
Forked command at 4378
numCores:8
cores:8
16777216
{noformat}

This is probably a different bug specific to Mesos fine-grained mode. My current workaround
is setting the `spark.buffer.pageSize` to the value of 16M which otherwise would also have
been used automatically in the coarse-grained mode.

> TungstenAggregation cannot acquire memory for pointer array after switching to sort-based
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-10474
>                 URL: https://issues.apache.org/jira/browse/SPARK-10474
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yi Zhou
>            Assignee: Andrew Or
>            Priority: Blocker
>             Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
>         at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
>         at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
>         at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:126)
>         at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
>         at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
>         at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
>         at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>         at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>         at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>         at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>         at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:88)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message