spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sathish Kumaran Vairavelu <vsathishkuma...@gmail.com>
Subject Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes
Date Fri, 11 Aug 2017 18:46:38 GMT
I think you can collect the results in driver through toLocalIterator
method of RDD and save the result to the driver program; rather than
writing it to the file on the local disk and collecting it separately. If
your data is small enough and if you have enough cores/memory try
processing everything in local mode and write the results locally.

-Sathish

On Fri, Aug 11, 2017 at 1:17 PM Steve Loughran <stevel@hortonworks.com>
wrote:

> On 10 Aug 2017, at 09:51, Hemanth Gudela <hemanth.gudela@qvantel.com>
> wrote:
>
> Yeah, installing HDFS in our environment is unfornutately going to take
> lot of time (approvals/planning etc). I will have to live with local FS for
> now.
> The other option I had already tried is collect() and send everything to
> driver node. But my data volume is too huge for driver node to handle alone.
>
>
> NFS cross mount.
>
>
> I’m now trying to split the data into multiple datasets, then collect
> individual dataset and write it to local FS on driver node (this approach
> slows down the spark job, but I hope it works).
>
>
>
> I doubt it. The job driver is in charge of committing work renaming data
> under _temporary into the right place. Every operation which calls write()
> to safe to an FS must have the same paths visible to all nodes in the spark
> cluster.
>
> A cluster-wide filesystem of some form is mandatory, or you abandon
> write() and implement your own operations to save (partitioned) data
>
>
> Thank you,
> Hemanth
>
> *From: *Femi Anthony <femibyte@gmail.com>
> *Date: *Thursday, 10 August 2017 at 11.24
> *To: *Hemanth Gudela <hemanth.gudela@qvantel.com>
> *Cc: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Re: spark.write.csv is not able write files to specified path,
> but is writing to unintended subfolder _temporary/0/task_xxx folder on
> worker nodes
>
> Also, why are you trying to write results locally if you're not using a
> distributed file system ? Spark is geared towards writing to a distributed
> file system. I would suggest trying to collect() so the data is sent to the
> master and then do a write if the result set isn't too big, or repartition
> before trying to write (though I suspect this won't really help). You
> really should install HDFS if that is possible.
>
> Sent from my iPhone
>
>
> On Aug 10, 2017, at 3:58 AM, Hemanth Gudela <hemanth.gudela@qvantel.com>
> wrote:
>
> Thanks for reply Femi!
>
> I’m writing the file like this à myDataFrame.
> write.mode("overwrite").csv("myFilePath")
> There absolutely are no errors/warnings after the write.
>
> _SUCCESS file is created on master node, but the problem of _temporary is
> noticed only on worked nodes.
>
> I know spark.write.csv works best with HDFS, but with the current setup I
> have in my environment, I have to deal with spark write to node’s local
> file system and not to HDFS.
>
> Regards,
> Hemanth
>
> *From: *Femi Anthony <femibyte@gmail.com>
> *Date: *Thursday, 10 August 2017 at 10.38
> *To: *Hemanth Gudela <hemanth.gudela@qvantel.com>
> *Cc: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Re: spark.write.csv is not able write files to specified path,
> but is writing to unintended subfolder _temporary/0/task_xxx folder on
> worker nodes
>
> Normally the* _temporary* directory gets deleted as part of the cleanup
> when the write is complete and a SUCCESS file is created. I suspect that
> the writes are not properly completed. How are you specifying the write ?
> Any error messages in the logs ?
>
> On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela <
> hemanth.gudela@qvantel.com> wrote:
>
> Hi,
>
> I’m running spark on cluster mode containing 4 nodes, and trying to write
> CSV files to node’s local path (*not HDFS*).
> I’m spark.write.csv to write CSV files.
>
> *On master node*:
> spark.write.csv creates a folder with csv file name and writes many files
> with part-r-000n suffix. This is okay for me, I can merge them later.
> *But on worker nodes*:
>                 spark.write.csv creates a folder with csv file name and
> writes many folders and files under _temporary/0/. This is not okay for me.
> Could someone please suggest me what could have been going wrong in my
> settings/how to be able to write csv files to the specified folder, and not
> to subfolders (_temporary/0/task_xxx) in worker machines.
>
> Thank you,
> Hemanth
>
>
>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>
>

Mime
View raw message