spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Smoliński <piotr.smolinski...@gmail.com>
Subject Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"
Date Mon, 26 Sep 2016 13:35:44 GMT
Best, you should write to HDFS or when you test the product with no HDFS
available just create a shared
filesystem (windows shares, nfs, etc.) where the data will be written.

You'll still end up with many files, but this time there will be only one
directory tree.

You may reduce the number of files by:
* combining partitions on the same executor with coalesce call
* repartitioning the RDD (DataFrame or DataSet depending on the API you use)

The latter one is useful when you write the data to a partitioned
structure. Note that repartitioning
is explicit shuffle.

If you want to have only single file you need to repartition the whole RDD
to single partition.
Depending on the result data size it may be something that you want or do
not want to do ;-)

Regards,
Piotr



On Mon, Sep 26, 2016 at 2:30 PM, Peter Figliozzi <pete.figliozzi@gmail.com>
wrote:

> Thank you Piotr, that's what happened.  In fact, there are about 100 files
> on each worker node in a directory corresponding to the write.
>
> Any way to tone that down a bit (maybe 1 file per worker)?  Or, write a
> single file somewhere?
>
>
> On Mon, Sep 26, 2016 at 12:44 AM, Piotr Smoliński <
> piotr.smolinski.77@gmail.com> wrote:
>
>> Hi Peter,
>>
>> The blank file _SUCCESS indicates properly finished output operation.
>>
>> What is the topology of your application?
>> I presume, you write to local filesystem and have more than one worker
>> machine.
>> In such case Spark will write the result files for each partition (in the
>> worker which
>> holds it) and complete operation writing the _SUCCESS in the driver node.
>>
>> Cheers,
>> Piotr
>>
>>
>> On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi <
>> pete.figliozzi@gmail.com> wrote:
>>
>>> Both
>>>
>>> df.write.csv("/path/to/foo")
>>>
>>> and
>>>
>>> df.write.format("com.databricks.spark.csv").save("/path/to/foo")
>>>
>>> results in a *blank* file called "_SUCCESS" under /path/to/foo.
>>>
>>> My df has stuff in it.. tried this with both my real df, and a quick df
>>> constructed from literals.
>>>
>>> Why isn't it writing anything?
>>>
>>> Thanks,
>>>
>>> Pete
>>>
>>
>>
>

Mime
View raw message