spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nmoretto <nic...@tyk.li>
Subject DIMSUM among 550k objects on AWS Elastic Map Reduce fails with OOM errors
Date Fri, 27 May 2016 10:38:28 GMT
Hello everyone, I am trying to compute the similarity between 550k objects
using the DIMSUM algorithm available in Spark 1.6.

The cluster runs on AWS Elastic Map Reduce and consists of 6 r3.2xlarge
instances (one master and five cores), having 8 vCPU and 61 GiB of RAM each.

My input data is a 3.5GB CSV file hosted on AWS S3, which I use to build a
RowMatrix with 550k columns and 550k rows, passing sparse vectors as rows to
the RowMatrix constructor.

At every attempt I've made so far the application fails during the
/mapPartitionWithIndex/ stage of the /RowMatrix.columnSimilarities()/ method
(source code at 
https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L587
<https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L587>

) with YARN containers 1) exiting with /FAILURE/ due to an /OutOfMemory/
exception on Java heap space (thanks to Spark, apparently) or 2) terminated
by AM (and increasing /spark.yarn.executor.memoryOverhead/ as suggested
doesn't seem to work).

I tried and combined different approaches without noticing significant
improvements:
- setting AWS EMR maximizeResourceAllocation option to true (details at 
https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
<https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html>

)
- increasing the number of partitions (via /spark.default.parallelism/, up
to 8000)
- increasing the driver and executor memory (respectively from default ~512M
/ ~5G to ~50G / ~15G)
- increasing YARN memory overhead (from default 10% up to 40% of driver and
executor memory, respectively)
- setting the DIMSUM threshold to 0.5 and 0.8 to reduce the number of
comparisons

Anyone has any idea about the possible cause(s) of these errors? Thank you.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DIMSUM-among-550k-objects-on-AWS-Elastic-Map-Reduce-fails-with-OOM-errors-tp27038.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message