spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammed Guller <moham...@glassbeam.com>
Subject RE: laziness in textFile reading from HDFS?
Date Wed, 30 Sep 2015 02:06:11 GMT
1) It is not required to have the same amount of memory as data. 
2) By default the # of partitions are equal to the number of HDFS blocks
3) Yes, the read operation is lazy
4) It is okay to have more number of partitions than number of cores. 

Mohammed

-----Original Message-----
From: davidkl [mailto:davidklmlg@hotmail.com] 
Sent: Monday, September 28, 2015 1:40 AM
To: user@spark.apache.org
Subject: laziness in textFile reading from HDFS?

Hello,

I need to process a significant amount of data every day, about 4TB. This will be processed
in batches of about 140GB. The cluster this will be running on doesn't have enough memory
to hold the dataset at once, so I am trying to understand how this works internally.

When using textFile to read an HDFS folder (containing multiple files), I understand that
the number of partitions created are equal to the number of HDFS blocks, correct? Are those
created in a lazy way? I mean, if the number of blocks/partitions is larger than the number
of cores/threads the Spark driver was launched with (N), are N partitions created initially
and then the rest when required? Or are all those partitions created up front?

I want to avoid reading the whole data into memory just to spill it out to disk if there is
no enough memory.

Thanks! 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/laziness-in-textFile-reading-from-HDFS-tp24837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail:
user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message