mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap space
Date Wed, 17 Feb 2016 00:09:39 GMT
the original exception definitely happens in the task when mahout tries to
build an entire matrix block out of a partition. Use more tasks, smaller in
size initially. using par(min=??) will help to repartition to at least ??
tasks. off-hdfs defaults are just too big for matrix processing. Not sure
how to do that with command line utility, Pat may help.

On Tue, Feb 16, 2016 at 9:59 AM, Jaume Galí <jgali@konodrac.com> wrote:

> Hi,
>
> I did all you suggest but i couldn’t solve the problem yet and i don’t
> know what else to do.
>
> Now I have a machine with 64Gb of Memory Ram, so physical memory should
> not be a problem any more.
> I attach input matrix if anybody could try to execute the command it would
> be great.
>
> This is what I tried:
>
> - I used this command as Angelo suggested:
>
> /opt/mahout/bin/mahout spark-rowsimilarity -i matrix_country_115k.dat -o
> test_country_115k_output.tmp --maxObservations 500 --maxSimilaritiesPerRow
> 100 --omitStrength --master local --sparkExecutorMem 10g
> -D:spark.dynamicAllocation.enabled=true
> -D:spark.shuffle.service.enabled=true
>
> - I increased *MAHOUT_HEAPSIZE *up to 32Gb in two ways:
>
>
> + Mahout script (MAHOUT_HOME/bin/mahout):
>
> JAVA=$JAVA_HOME/bin/java
>
> JAVA_HEAP_MAX=-Xmx4g
>
> MAHOUT_HEAPSIZE=32768
>
>
> + ~/.profile setting environment variables:
>
> #Global conf JAVA
> export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
> export JAVA_OPTS=-Xmx32g
> export _JAVA_OPTIONS=-Xmx32g
> export HADOOP_PREFIX=/opt/hadoop
> export SPARK_HOME=/opt/spark
> export MAHOUT_HOME=/opt/mahout
> export MAHOUT_HEAPSIZE=32g
>
>
> I printed trace of memoery following mahout script and this is output:
>
> run with heapsize 32768
>
> -Xmx32768m
>
> So mahout is reading memory parameters fine.
>
>
> I’m glad if you people could guide me about what parameters I have to tune
> or check in order to solved this issue because I don’t know to do
>
> Thank you for advance.
> Jaume.
>
>
>
> El 13/2/2016, a las 22:56, Pat Ferrel <pat@occamsmachete.com> escribió:
>
> OK, this makes sense. When people see Out of Memory problems they
> naturally try to give more to the process throwing the exception but what
> is often happening is that you have given too much to the collection of
> other processes on the machine so there is not enough to go around and the
> allocation fails on Spark. In which case you need to allocate less to Spark
> so you can guarantee it will always be able to get that much.
>
>
> On Feb 13, 2016, at 9:30 AM, Angelo Leto <angleto@gmail.com> wrote:
>
> I was able to make it working by setting the executor memory to 10g
> and with -D:spark.dynamicAllocation.enabled=true :
>
> mahout spark-rowsimilarity --input hdfs:/indata/row-similarity.tsv
> --output rowsim-out --omitStrength --sparkExecutorMem 10g --master
> yarn-client -D:spark.dynamicAllocation.enabled=true
> -D:spark.shuffle.service.enabled=true
>
>
> On Sat, Feb 13, 2016 at 2:42 PM, Angelo Leto <angleto@gmail.com> wrote:
>
> Hello,
> I have the same problem described above using spark-rowsimilarity.
> I have a ~65k lines input file (each row with less than 300 items),
> and I run the job on a small cluster with 1 master and 2 workers, each
> machine has 15GB of RAM.
> I tried to increase executor and driver memory:
> --sparkExecutorMem 15g
> -D:spark.driver.memory=15g
>
> but I get the OutOfMemoryError exception:
>
> 16/02/13 13:00:36 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID
> 12)
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>       at
> org.apache.mahout.math.OrderedIntDoubleMapping.growTo(OrderedIntDoubleMapping.java:86)
>       at
> org.apache.mahout.math.OrderedIntDoubleMapping.set(OrderedIntDoubleMapping.java:118)
> [...]
>
> Thanks for any hint.
> Angelo
>
> On Fri, Feb 12, 2016 at 10:15 PM, Pat Ferrel <pat@occamsmachete.com>
> wrote:
>
> You have to set the executor memory. BTW you have given the driver all
> memory on the machine.
>
> On Feb 10, 2016, at 9:30 AM, Jaume Galí <jgali@konodrac.com> wrote:
>
> Hi again,
> (Sorry for my delay but we didn’t have machine to test your thoughts about
> memory issue.)
>
> The problem still happening testing with an input matrix of 100k rows by
> 300 items, I increase memory as you suggest but nothing changed. I attached
> spark_env.sh and new specs of machine
>
> Machine specs:
>
> m3.xlarge AWS (Ivy Bridge, 15Gb ram, 2x40gb HD)
>
> This is my spark-env.sh:
>
>        #!/usr/bin/env bash
> # Licensed to ...
>
> export SPARK_HOME=${SPARK_HOME:-/usr/lib/spark}
> export SPARK_LOG_DIR=${SPARK_LOG_DIR:-/var/log/spark}
> export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
> export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
> export HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
>
> export STANDALONE_SPARK_MASTER_HOST=ip-10-12-17-235.eu <
> http://ip-10-12-17-235.eu/>-west-1.compute.internal
> export SPARK_MASTER_PORT=7077
> export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
> export SPARK_MASTER_WEBUI_PORT=8080
>
> export SPARK_WORKER_DIR=${SPARK_WORKER_DIR:-/var/run/spark/work}
> export SPARK_WORKER_PORT=7078
> export SPARK_WORKER_WEBUI_PORT=8081
>
> export HIVE_SERVER2_THRIFT_BIND_HOST=0.0.0.0
> export HIVE_SERVER2_THRIFT_PORT=10001
>
> export SPARK_DRIVER_MEMORY=15G
> export SPARK_DAEMON_JAVA_OPTS="$SPARK_DAEMON_JAVA_OPTS
> -XX:OnOutOfMemoryError='kill -9 %p’”
>
> Log:
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent
> failure: Lost task 0.0 in stage 12.0 (TID 24, localhost):
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> …….
> …..
> ..
> .
>
> Driver stacktrace:
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> …….
> …..
> ...
> ..
> .
>
>
> Thanks for advance
>
> El 2/2/2016, a las 7:48, Pat Ferrel <pat@occamsmachete.com <
> mailto:pat@occamsmachete.com <pat@occamsmachete.com>>> escribió:
>
> You probably need to increase your driver memory and 8g will not work. 16g
> is probably the smallest stand alone machine that will work since the
> driver and executors run on it.
>
> On Feb 1, 2016, at 1:24 AM, jgali@konodrac.com <mailto:jgali@konodrac.com
> <jgali@konodrac.com>> wrote:
>
> Hello everybody,
>
> We are experimenting problems when we use "mahout spark-rowsimilarity”
> operation. We have an input matrix with 100k rows and 100 items and process
> throws an exception about “Exception in task 0.0 in stage 13.0 (TID 13)
> java.lang.OutOfMemoryError: Java heap space” and we try to increase JAVA
> HEAP MEMORY, MAHOUT HEAP MEMORY and spark.driver.memory.
>
> Environment versions:
> Mahout: 0.11.1
> Spark: 1.6.0.
>
> Mahout command line:
>   /opt/mahout/bin/mahout spark-rowsimilarity -i 50k_rows__50items.dat -o
> test_output.tmp --maxObservations 500 --maxSimilaritiesPerRow 100
> --omitStrength --master local --sparkExecutorMem 8g
>
> This process is running on a machine with following specifications:
> Mem RAM: 8gb
> CPU with 8 cores
>
> .profile file:
> export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
> export HADOOP_HOME=/opt/hadoop-2.6.0
> export SPARK_HOME=/opt/spark
> export MAHOUT_HOME=/opt/mahout
> export MAHOUT_HEAPSIZE=8192
>
> Throws exception:
>
> 16/01/22 11:45:06 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID
> 13)
> java.lang.OutOfMemoryError: Java heap space
>    at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:66)
>    at
> org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:70)
>    at
> org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:59)
>    at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>    at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>    at org.apache.spark.scheduler.Task.run(Task.scala:89)
>    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>    at java.lang.Thread.run(Thread.java:745)
> 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message [message
> = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver,
> localhost, 42107))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
>    at org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>    at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>    at org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468)
>    at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>    at java.lang.Thread.run(Thread.java:745)
> 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message [message
> = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver,
> localhost, 42107))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
>    at org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>    at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>    at org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468)
>    at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>    at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
>    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>    at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>    at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>    at scala.concurrent.Await$.result(package.scala:107)
>    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>    ...
>
> Can you please advise?
>
>
> Thanks for advance.
> Cheers.
>
>
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message