spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yanboha...@gmail.com>
Subject Re: Spark saveAsText file size
Date Tue, 25 Nov 2014 03:38:10 GMT
In memory cache may be blow up the size of RDD.
It's general condition that RDD will take more space in memory than disk.
There are options to configure and optimize storage space efficiency in
Spark, take a look at this https://spark.apache.org/docs/latest/tuning.html


2014-11-25 10:38 GMT+08:00 Alan Prando <alan@scanboo.com.br>:

> Hi Folks!
>
> I'm running a spark JOB on a cluster with 9 slaves and 1 master (250GB
> RAM, 32 cores each and 1TB of storage each).
>
> This job generates 1.200 TB of data on a RDD with 1200 partitions.
> When I call saveAsTextFile("hdfs://..."), spark creates 1200 files named
> "part-000*" on HDFS's folder. However, just a few files have content (~450
> files has 2.3GB) and all others with no content (0 bytes).
>
> Is there any explanation for this file size (2.3GB)?
> Shouldn't spark saves 1200 files with 1GB each?
>
> Thanks in advance.
>
> ---
> Regards,
> Alan Vidotti Prando.
>

Mime
View raw message