I may not be correct (in fact I may be completely opposite), but here is my guess:

Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images, would require 153.6 GB (12k*4000*400*8) of data which may justify the amount of data to be written to the disk. Without compression, it seems it would be using roughly that much data. You can further cross check what is the storage level for your RDDs, default is MEMORY_ONLY. In case it is also spilling data to disk, it would further increase the storage needed.

On Fri, Nov 28, 2014 at 10:43 PM, Jaonary Rabarisoa <jaonary@gmail.com> wrote:
Dear all,

I have a job that crashes before its end because of no space left on device, and I noticed that this job generates a lots of temporary data on my disk.

To be precise, the job is a simple map job that takes a set of images, extracts local features and save these local features as a sequence file. My images are represented as a key value pair where the key are strings representing the id of the image (the filename) and the values are the base64 encoding of the images.

To extract the features, I use an external c program that I call with RDD.pipe. I stream the base64 image to the c program and it sends back the extracted feature vectors through stdout. Each line represents one feature vector from the current image. I don't use any serialization library, I just write the feature vector element on the stdout separated by space. Once in spark, I just split the line and create a scala vector from each value and save my sequence file.

The overall job looks like the following :

val images: RDD[(String, String) = ...
val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...)
features.saveAsSequenceFile(...)

The problem is that for about 3G of image data (about 12000 images) this job generates more than 180G of temporary data. It seems to be strange since for each image I have about 4000 double feature vectors of dimension 400. 

I run the job on my laptop for test purpose that why I can't add additional disk space. By the way, I need to understand why this simple job generates such a lot of data and how can I reduce this ?


Best,

Jao






--
Regards,
Vikas Agarwal
91 – 9928301411

InfoObjects, Inc. 
Execution Matters
http://www.infoobjects.com 
2041 Mission College Boulevard, #280 
Santa Clara, CA 95054
+1 (408) 988-2000 Work
+1 (408) 716-2726 Fax