spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantinos Kougios <kostas.koug...@googlemail.com>
Subject Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS
Date Mon, 08 Jun 2015 15:11:32 GMT
It was giving the same error, which made me figure out it is the driver 
but the driver running on hadoop - not the local one. So I did

     --conf spark.driver.memory=8g

and now it is processing the files!

Cheers


On 08/06/15 15:52, Ewan Leith wrote:
> Can you do a simple
>
> sc.binaryFiles("hdfs:///path/to/files/*").count()
>
> in the spark-shell and verify that part works?
>
> Ewan
>
>
>
> -----Original Message-----
> From: Konstantinos Kougios [mailto:kostas.kougios@googlemail.com]
> Sent: 08 June 2015 15:40
> To: Ewan Leith; user@spark.apache.org
> Subject: Re: spark timesout maybe due to binaryFiles() with more than 1 million files
in HDFS
>
> No luck I am afraid. After giving the namenode 16GB of RAM, I am still getting an out
of mem exception, kind of different one:
>
> 15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw
> exception: GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>       at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1351)
>       at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
>       at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
>       at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
>       at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
>       at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>       at com.sun.proxy.$Proxy10.getListing(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
>       at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
>       at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
>       at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
>       at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
>       at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)
>       at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:292)
>       at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
>       at
> org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:47)
>       at
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:43)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
>       at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
>
>
> and on the 2nd retry of spark, a similar exception:
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>       at
> com.google.protobuf.LiteralByteString.toString(LiteralByteString.java:148)
>       at com.google.protobuf.ByteString.toStringUtf8(ByteString.java:572)
>       at
> org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$HdfsFileStatusProto.getOwner(HdfsProtos.java:21558)
>       at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
>       at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
>       at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
>       at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
>       at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>       at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>       at com.sun.proxy.$Proxy10.getListing(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
>       at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
>       at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
>       at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
>       at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
>       at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)
>       at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:292)
>       at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
>       at
> org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:47)
>       at
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:43)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
>       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
>       at scala.Option.getOrElse(Option.scala:120)
>       at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
>
>
> Any ideas which part of hadoop is running out of mem?
>


Mime
View raw message