spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Can spark job have sideeffects (write files to FileSystem)
Date Tue, 16 Dec 2014 00:47:48 GMT
Thinking about that any task could be launched concurrently in
different nodes, so in order to make sure the generated files are
valid, you need some atomic operation (such as rename) to do it. For
example, you could generate a random name for output file, writing the
data into it, rename it to the target name finally. This is what
happened in saveAsTextFile().

On Mon, Dec 15, 2014 at 4:37 PM, Paweł Szulc <paul.szulc@gmail.com> wrote:
> Yes, this is what I also found in Spark documentation, that foreach can have
> side effects. Nevertheless I have this weird error, that sometimes files are
> just empty.
>
> "using" is simply a wrapper that takes our code, makes try-catch-finally and
> flush & close all resources.
>
> I honestly have no clue what can possibly be wrong.
>
> No errors in logs.
>
> On Thu, Dec 11, 2014 at 2:29 PM, Daniel Darabos
> <daniel.darabos@lynxanalytics.com> wrote:
>>
>> Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may
>> be encountering an IO exception while writing, and maybe using() suppresses
>> it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
>> expect there is less that can go wrong with that simple call.
>>
>> On Thu, Dec 11, 2014 at 12:50 PM, Paweł Szulc <paul.szulc@gmail.com>
>> wrote:
>>>
>>> Imagine simple Spark job, that will store each line of the RDD to a
>>> separate file
>>>
>>>
>>> val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
>>> lines.foreach(line => writeToFile(line))
>>>
>>> def writeToFile(line: String) = {
>>>     def filePath = "file://..."
>>>     val file = new File(new URI(path).getPath)
>>>     // using function simply closes the output stream
>>>     using(new FileOutputStream(file)) { output =>
>>>       output.write(value)
>>>     }
>>> }
>>>
>>>
>>> Now, example above works 99,9% of a time. Files are generated for each
>>> line, each file contains that particular line.
>>>
>>> However, when dealing with large number of data, we encounter situations
>>> where some of the files are empty! Files are generated, but there is no
>>> content inside of them (0 bytes).
>>>
>>> Now the question is: can Spark job have side effects. Is it even legal to
>>> write such code?
>>> If no, then what other choice do we have when we want to save data from
>>> our RDD?
>>> If yes, then do you guys see what could be the reason of this job acting
>>> in this strange manner 0.1% of the time?
>>>
>>>
>>> disclaimer: we are fully aware of .saveAsTextFile method in the API,
>>> however the example above is a simplification of our code - normally we
>>> produce PDF files.
>>>
>>>
>>> Best regards,
>>> Paweł Szulc
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message