spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)
Date Wed, 14 Sep 2016 14:16:36 GMT
It could be that by using the rdd it converts the data from the internal format to Java objects
(-> much more memory is needed), which may lead to spill over to disk. This conversion
takes a lot of time. Then, you need to transfer these Java objects via network to one single
node (repartition ...), which takes on a 1 gbit network for 3 gb (since it may transfer Java
objects this might be even more for 3 gb) under optimal conditions ca 25 seconds (if no other
transfers happening at the same time, jumbo frames activated etc). On the destination node
we may have again spill over to disk. Then you store them to a single disk (potentially multiple
if you have and use HDFS) which takes also time (assuming that no other process uses this
disk). 

Btw spark-csv can be used with different dataframes.
As said, other options are compression, avoid repartitioning (to avoid network transfer),
avoid spilling to disk (provide memory in yarn etc), increase network bandwidth ...

> On 14 Sep 2016, at 14:22, sanat kumar Patnaik <patnaik.sanat@gmail.com> wrote:
> 
> These are not csv files, utf8 files with a specific delimiter.
> I tried this out with a file(3 GB):
> 
> myDF.write.json("output/myJson")
> Time taken- 60 secs approximately.
> 
> myDF.rdd.repartition(1).saveAsTextFile("output/text")
> Time taken 160 secs
> 
> That is where I am concerned, the time to write a text file compared to json grows exponentially.
> 
>> On Wednesday, September 14, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>> These intermediate file what sort of files are there. Are there csv type files.
>> 
>> I agree that DF is more efficient than an RDD as it follows tabular format (I assume
that is what you mean by "columnar" format). So if you read these files in a bath process
you may not worry too much about execution time?
>> 
>> A textFile saving is simply a one to one mapping from your DF to HDFS. I think it
is pretty efficient.
>> 
>> For myself, I would do something like below
>> 
>> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>>  
>> 
>>> On 14 September 2016 at 12:46, sanat kumar Patnaik <patnaik.sanat@gmail.com>
wrote:
>>> Hi All,
>>> 
>>> I am writing a batch application using Spark SQL and Dataframes. This application
has a bunch of file joins and there are intermediate points where I need to drop a file for
downstream applications to consume.
>>> The problem is all these downstream applications are still on legacy, so they
still require us to drop them a text file.As you all must be knowing Dataframe stores the
data in columnar format internally.
>>> Only way I found out how to do this and which looks awfully slow is this:
>>> 
>>> myDF=sc.textFile("inputpath").toDF()
>>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>>  
>>> Is there any better way to do this?
>>> 
>>> P.S: The other workaround would be to use RDDs for all my operations. But I am
wary of using them as the documentation says Dataframes are way faster because of the Catalyst
engine running behind the scene.
>>> 
>>> Please suggest if any of you might have tried something similar.
>> 
> 
> 
> -- 
> Regards,
> Sanat Patnaik
> Cell->804-882-6424

Mime
View raw message