spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gil Vernik <G...@il.ibm.com>
Subject saveAsTextFile and tmp files generations in tasks
Date Wed, 15 Apr 2015 05:33:16 GMT
Hi,

I run very simple operation via ./spark-shell (version 1.3.0 ):

val data = Array(1, 2, 3, 4)
val distd = sc.parallelize(data)
distd.saveAsTextFile(.. )

When i executed it, I saw that 4 tasks very created in Spark.  Each task 
created 2 temp files at different stages, there was 1st tmp file ( with 
some long name ) that at some point it was renamed to 2nd tmp file with 
another name. 
By task completion the 2nd tmp file was renamed to PART-XXXX file.  So in 
totally for 4 tasks i had about 8 tmp files..

I have some questions related those tmp files generations.
What is the logic and algorithm in tasks to generate those tmp files. Can 
someone explain it to me?  Why there were 2 tmp files ( one after another 
) and not a single tmp file? 
Is this something configurable in Spark? I mean can i run saveAsTextFile 
so tasks will run without tmp files creations? Can this tmp data be 
created in memory?

And the last one, where is the code that responsible for this?

Thanks a lot,
Gil Vernik.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message