mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap space
Date Sat, 13 Feb 2016 21:56:43 GMT
OK, this makes sense. When people see Out of Memory problems they naturally try to give more
to the process throwing the exception but what is often happening is that you have given too
much to the collection of other processes on the machine so there is not enough to go around
and the allocation fails on Spark. In which case you need to allocate less to Spark so you
can guarantee it will always be able to get that much.


> On Feb 13, 2016, at 9:30 AM, Angelo Leto <angleto@gmail.com> wrote:
> 
> I was able to make it working by setting the executor memory to 10g
> and with -D:spark.dynamicAllocation.enabled=true :
> 
> mahout spark-rowsimilarity --input hdfs:/indata/row-similarity.tsv
> --output rowsim-out --omitStrength --sparkExecutorMem 10g --master
> yarn-client -D:spark.dynamicAllocation.enabled=true
> -D:spark.shuffle.service.enabled=true
> 
> 
> On Sat, Feb 13, 2016 at 2:42 PM, Angelo Leto <angleto@gmail.com> wrote:
>> Hello,
>> I have the same problem described above using spark-rowsimilarity.
>> I have a ~65k lines input file (each row with less than 300 items),
>> and I run the job on a small cluster with 1 master and 2 workers, each
>> machine has 15GB of RAM.
>> I tried to increase executor and driver memory:
>> --sparkExecutorMem 15g
>> -D:spark.driver.memory=15g
>> 
>> but I get the OutOfMemoryError exception:
>> 
>> 16/02/13 13:00:36 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 12)
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>        at org.apache.mahout.math.OrderedIntDoubleMapping.growTo(OrderedIntDoubleMapping.java:86)
>>        at org.apache.mahout.math.OrderedIntDoubleMapping.set(OrderedIntDoubleMapping.java:118)
>> [...]
>> 
>> Thanks for any hint.
>> Angelo
>> 
>> On Fri, Feb 12, 2016 at 10:15 PM, Pat Ferrel <pat@occamsmachete.com> wrote:
>>> You have to set the executor memory. BTW you have given the driver all memory
on the machine.
>>> 
>>>> On Feb 10, 2016, at 9:30 AM, Jaume Galí <jgali@konodrac.com> wrote:
>>>> 
>>>> Hi again,
>>>> (Sorry for my delay but we didn’t have machine to test your thoughts about
memory issue.)
>>>> 
>>>> The problem still happening testing with an input matrix of 100k rows by
300 items, I increase memory as you suggest but nothing changed. I attached spark_env.sh and
new specs of machine
>>>> 
>>>> Machine specs:
>>>> 
>>>> m3.xlarge AWS (Ivy Bridge, 15Gb ram, 2x40gb HD)
>>>> 
>>>> This is my spark-env.sh:
>>>> 
>>>>         #!/usr/bin/env bash
>>>> # Licensed to ...
>>>> 
>>>> export SPARK_HOME=${SPARK_HOME:-/usr/lib/spark}
>>>> export SPARK_LOG_DIR=${SPARK_LOG_DIR:-/var/log/spark}
>>>> export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
>>>> export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
>>>> export HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
>>>> 
>>>> export STANDALONE_SPARK_MASTER_HOST=ip-10-12-17-235.eu <http://ip-10-12-17-235.eu/>-west-1.compute.internal
>>>> export SPARK_MASTER_PORT=7077
>>>> export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
>>>> export SPARK_MASTER_WEBUI_PORT=8080
>>>> 
>>>> export SPARK_WORKER_DIR=${SPARK_WORKER_DIR:-/var/run/spark/work}
>>>> export SPARK_WORKER_PORT=7078
>>>> export SPARK_WORKER_WEBUI_PORT=8081
>>>> 
>>>> export HIVE_SERVER2_THRIFT_BIND_HOST=0.0.0.0
>>>> export HIVE_SERVER2_THRIFT_PORT=10001
>>>> 
>>>> export SPARK_DRIVER_MEMORY=15G
>>>> export SPARK_DAEMON_JAVA_OPTS="$SPARK_DAEMON_JAVA_OPTS -XX:OnOutOfMemoryError='kill
-9 %p’”
>>>> 
>>>> Log:
>>>> 
>>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0
in stage 12.0 (TID 24, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> …….
>>>> …..
>>>> ..
>>>> .
>>>> 
>>>> Driver stacktrace:
>>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> …….
>>>> …..
>>>> ...
>>>> ..
>>>> .
>>>> 
>>>> 
>>>> Thanks for advance
>>>> 
>>>>> El 2/2/2016, a las 7:48, Pat Ferrel <pat@occamsmachete.com <mailto:pat@occamsmachete.com>>
escribió:
>>>>> 
>>>>> You probably need to increase your driver memory and 8g will not work.
16g is probably the smallest stand alone machine that will work since the driver and executors
run on it.
>>>>> 
>>>>>> On Feb 1, 2016, at 1:24 AM, jgali@konodrac.com <mailto:jgali@konodrac.com>
wrote:
>>>>>> 
>>>>>> Hello everybody,
>>>>>> 
>>>>>> We are experimenting problems when we use "mahout spark-rowsimilarity”
operation. We have an input matrix with 100k rows and 100 items and process throws an exception
about “Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap
space” and we try to increase JAVA HEAP MEMORY, MAHOUT HEAP MEMORY and spark.driver.memory.
>>>>>> 
>>>>>> Environment versions:
>>>>>> Mahout: 0.11.1
>>>>>> Spark: 1.6.0.
>>>>>> 
>>>>>> Mahout command line:
>>>>>>    /opt/mahout/bin/mahout spark-rowsimilarity -i 50k_rows__50items.dat
-o test_output.tmp --maxObservations 500 --maxSimilaritiesPerRow 100 --omitStrength --master
local --sparkExecutorMem 8g
>>>>>> 
>>>>>> This process is running on a machine with following specifications:
>>>>>> Mem RAM: 8gb
>>>>>> CPU with 8 cores
>>>>>> 
>>>>>> .profile file:
>>>>>> export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
>>>>>> export HADOOP_HOME=/opt/hadoop-2.6.0
>>>>>> export SPARK_HOME=/opt/spark
>>>>>> export MAHOUT_HOME=/opt/mahout
>>>>>> export MAHOUT_HEAPSIZE=8192
>>>>>> 
>>>>>> Throws exception:
>>>>>> 
>>>>>> 16/01/22 11:45:06 ERROR Executor: Exception in task 0.0 in stage
13.0 (TID 13)
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>     at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:66)
>>>>>>     at org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:70)
>>>>>>     at org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:59)
>>>>>>     at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>>>>>     at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>>>>>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>>     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>>>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>>>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>> 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message
[message = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver, localhost, 42107))]
in 1 attempts
>>>>>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after
[120 seconds]. This timeout is controlled by spark.rpc.askTimeout
>>>>>>     at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>>>>>>     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>>>>>>     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>>>>>>     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>>>>>>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>>>>>>     at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>>>>>>     at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>>>>>>     at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>>>>>>     at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468)
>>>>>>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>>>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>>>>>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>>>>>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>>>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>> 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message
[message = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver, localhost, 42107))]
in 1 attempts
>>>>>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after
[120 seconds]. This timeout is controlled by spark.rpc.askTimeout
>>>>>>     at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>>>>>>     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>>>>>>     at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>>>>>>     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>>>>>>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>>>>>>     at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>>>>>>     at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>>>>>>     at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>>>>>>     at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>>>>>>     at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468)
>>>>>>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>>>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>>>>>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>>>>>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>>>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>     at java.lang.Thread.run(Thread.java:745)
>>>>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out
after [120 seconds]
>>>>>>     at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>>>>>     at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>>>>>     at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>>>>>     at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>>>>>     at scala.concurrent.Await$.result(package.scala:107)
>>>>>>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>>>>>>     ...
>>>>>> 
>>>>>> Can you please advise?
>>>>>> 
>>>>>> 
>>>>>> Thanks for advance.
>>>>>> Cheers.
>>>>> 
>>>> 
>>> 


Mime
View raw message