spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Ganelin <ilgan...@gmail.com>
Subject Re: IOException and appcache FileNotFoundException in Spark 1.02
Date Wed, 15 Oct 2014 03:24:53 GMT
Hello all . Does anyone else have any suggestions? Even understanding what
this error is from would help a lot.
On Oct 11, 2014 12:56 AM, "Ilya Ganelin" <ilganeli@gmail.com> wrote:

> Hi Akhil - I tried your suggestions and tried varying my partition sizes.
> Reducing the number of partitions led to memory errors (presumably - I saw
> IOExceptions much sooner).
>
> With the settings you provided the program ran for longer but ultimately
> crashes in the same way. I would like to understand what is going on
> internally leading to this.
>
> Could this be related to garbage collection?
> On Oct 10, 2014 3:19 AM, "Akhil Das" <akhil@sigmoidanalytics.com> wrote:
>
>> You could be hitting this issue
>> <https://issues.apache.org/jira/browse/SPARK-3633> (or similar). You can
>> try the following workarounds:
>>
>> sc.set("spark.core.connection.ack.wait.timeout","600")
>> sc.set("spark.akka.frameSize","50")
>> Also reduce the number of partitions, you could be hitting the kernel's
>> ulimit. I faced this issue and it was gone when i dropped the partitions
>> from 1600 to 200.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin <ilganeli@gmail.com> wrote:
>>
>>> Hi all – I could use some help figuring out a couple of exceptions I’ve
>>> been getting regularly.
>>>
>>> I have been running on a fairly large dataset (150 gigs). With smaller
>>> datasets I don't have any issues.
>>>
>>> My sequence of operations is as follows – unless otherwise specified, I
>>> am not caching:
>>>
>>> Map a 30 million row x 70 col string table to approx 30 mil x  5 string
>>> (For read as textFile I am using 1500 partitions)
>>>
>>> From that, map to ((a,b), score) and reduceByKey, numPartitions = 180
>>>
>>> Then, extract distinct values for A and distinct values for B. (I cache
>>> the output of distinct), , numPartitions = 180
>>>
>>> Zip with index for A and for B (to remap strings to int)
>>>
>>> Join remapped ids with original table
>>>
>>> This is then fed into MLLIBs ALS algorithm.
>>>
>>> I am running with:
>>>
>>> Spark version 1.02 with CDH5.1
>>>
>>> numExecutors = 8, numCores = 14
>>>
>>> Memory = 12g
>>>
>>> MemoryFration = 0.7
>>>
>>> KryoSerialization
>>>
>>> My issue is that the code runs fine for a while but then will
>>> non-deterministically crash with either file IOExceptions or the following
>>> obscure error:
>>>
>>> 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to
>>> java.io.IOException: Filesystem closed [duplicate 10]
>>>
>>> 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to
>>> java.io.FileNotFoundException
>>>
>>> java.io.FileNotFoundException:
>>> /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354
>>> (No such file or directory)
>>>
>>> Looking through the logs, I see the IOException in other places but it
>>> appears to be non-catastrophic. The FileNotFoundException, however, is. I
>>> have found the following stack overflow that at least seems to address the
>>> IOException:
>>>
>>>
>>> http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed
>>>
>>> But I have not found anything useful at all with regards to the app
>>> cache error.
>>>
>>> Any help would be much appreciated.
>>>
>>
>>

Mime
View raw message