spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gil Vernik <G...@il.ibm.com>
Subject Re: saveAsTextFile and tmp files generations in tasks
Date Thu, 16 Apr 2015 03:59:30 GMT
Thanks a lot for the info on it.
Does this explains 2 temp file generation per each task ( one temp that is 
renamed to another )? 
I understand why there is one temp file per task, but still not sure why 
there were 2 per each task,

Thanks
Gil.





From:   Imran Rashid <irashid@cloudera.com>
To:     Gil Vernik/Haifa/IBM@IBMIL
Cc:     dev <dev@spark.apache.org>
Date:   15/04/2015 06:20 PM
Subject:        Re: saveAsTextFile and tmp files generations in tasks



The temp file creation is controlled by a hadoop OutputCommitter, which is
normally FileOutputCommitter by default.  Its used in SparkHadoopWriter
(which in turn is used by PairRDDFunctions.saveAsHadoopDataset).

You could change the output committer to not use tmp files (eg. use this
from Aaron Davidson: https://gist.github.com/aarondav/c513916e72101bbe14ec
).


On Wed, Apr 15, 2015 at 12:33 AM, Gil Vernik <GILV@il.ibm.com> wrote:

> Hi,
>
> I run very simple operation via ./spark-shell (version 1.3.0 ):
>
> val data = Array(1, 2, 3, 4)
> val distd = sc.parallelize(data)
> distd.saveAsTextFile(.. )
>
> When i executed it, I saw that 4 tasks very created in Spark.  Each task
> created 2 temp files at different stages, there was 1st tmp file ( with
> some long name ) that at some point it was renamed to 2nd tmp file with
> another name.
> By task completion the 2nd tmp file was renamed to PART-XXXX file.  So 
in
> totally for 4 tasks i had about 8 tmp files..
>
> I have some questions related those tmp files generations.
> What is the logic and algorithm in tasks to generate those tmp files. 
Can
> someone explain it to me?  Why there were 2 tmp files ( one after 
another
> ) and not a single tmp file?
> Is this something configurable in Spark? I mean can i run saveAsTextFile
> so tasks will run without tmp files creations? Can this tmp data be
> created in memory?
>
> And the last one, where is the code that responsible for this?
>
> Thanks a lot,
> Gil Vernik.
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message