spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paweł Szulc <paul.sz...@gmail.com>
Subject Re: Can spark job have sideeffects (write files to FileSystem)
Date Tue, 16 Dec 2014 00:37:00 GMT
Yes, this is what I also found in Spark documentation, that foreach can
have side effects. Nevertheless I have this weird error, that sometimes
files are just empty.

"using" is simply a wrapper that takes our code, makes try-catch-finally
and flush & close all resources.

I honestly have no clue what can possibly be wrong.

No errors in logs.

On Thu, Dec 11, 2014 at 2:29 PM, Daniel Darabos <
daniel.darabos@lynxanalytics.com> wrote:
>
> Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may
> be encountering an IO exception while writing, and maybe using() suppresses
> it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
> expect there is less that can go wrong with that simple call.
>
> On Thu, Dec 11, 2014 at 12:50 PM, Paweł Szulc <paul.szulc@gmail.com>
> wrote:
>
>> Imagine simple Spark job, that will store each line of the RDD to a
>> separate file
>>
>>
>> val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
>> lines.foreach(line => writeToFile(line))
>>
>> def writeToFile(line: String) = {
>>     def filePath = "file://..."
>>     val file = new File(new URI(path).getPath)
>>     // using function simply closes the output stream
>>     using(new FileOutputStream(file)) { output =>
>>       output.write(value)
>>     }
>> }
>>
>>
>> Now, example above works 99,9% of a time. Files are generated for each
>> line, each file contains that particular line.
>>
>> However, when dealing with large number of data, we encounter situations
>> where some of the files are empty! Files are generated, but there is no
>> content inside of them (0 bytes).
>>
>> Now the question is: can Spark job have side effects. Is it even legal to
>> write such code?
>> If no, then what other choice do we have when we want to save data from
>> our RDD?
>> If yes, then do you guys see what could be the reason of this job acting
>> in this strange manner 0.1% of the time?
>>
>>
>> disclaimer: we are fully aware of .saveAsTextFile method in the API,
>> however the example above is a simplification of our code - normally we
>> produce PDF files.
>>
>>
>> Best regards,
>> Paweł Szulc
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message