spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Nerothin <jasonnerot...@gmail.com>
Subject Re: Spark2: Deciphering saving text file name
Date Tue, 09 Apr 2019 17:05:38 GMT
Hi Subash,

Short answer: It’s effectively random.

Longer answer: In general the DataFrameWriter expects to be receiving data
from multiple partitions. Let’s say you were writing to ORC instead of text.

In this case, even when you specify the output path, the writer creates a
directory at the specified path and saves one of those funny-named files
per partition.

Even longer: Assume you are running atop of YARN (or Messi or K8S...) In
this case, the resource manager is responsible for provisioning disk on
request, and it is the programmers’ responsibility to implement the
upstream business logic.

The implication is that it’s probably not a good idea to violate the
responsibility boundary. Because, if you do, you are probably going to
violate some implicit assumptions that the YARN designers are relying upon.
For example (just making this up): YARN will calculate available disk after
each write action completes.

HTH,
Jason



On Mon, Apr 8, 2019 at 19:55 Subash Prabakar <subashprabakar@gmail.com>
wrote:

> Hi,
> While saving in Spark2 as text file - I see encoded/hash value attached in
> the part files along with part number. I am curious to know what is that
> value is about ?
>
> Example:
> ds.write.save(SaveMode.Overwrite).option("compression","gzip").text(path)
>
> Produces,
> part-00001-1e4c5369-6694-4012-894a-73b971fe1ab1-c000.txt.gz
>
>
> 1e4c5369-6694-4012-894a-73b971fe1ab1-c000 => what is this value ?
>
> Is there any options available to remove this part or is it attached for
> some reason ?
>
> Thanks,
> Subash
>
-- 
Thanks,
Jason

Mime
View raw message