spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karlson <ksonsp...@siberie.de>
Subject Storage of RDDs created via sc.parallelize
Date Fri, 20 Mar 2015 19:09:07 GMT

Hi all,

where is the data stored that is passed to sc.parallelize? Or put 
differently, where is the data for the base RDD fetched from when the 
DAG is executed, if the base RDD is constructed via sc.parallelize?

I am reading a csv file via the Python csv module and am feeding the 
parsed data chunkwise to sc.parallelize, because the whole file would 
not fit into memory on the driver. Reading the file with sc.textfile 
first is not an option, as there might be linebreaks inside the csv 
fields, preventing me from parsing the file line by line.

The problem I am facing right now is that even though I am feeding only 
one chunk at a time to Spark, I will eventually run out of memory on the 
driver.

Thanks in advance!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message