spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <>
Subject Re: Storage of RDDs created via sc.parallelize
Date Mon, 23 Mar 2015 06:52:38 GMT
You can use sc.newAPIHadoopFile
with CSVInputFormat <> so that it
will read the csv file properly.

Best Regards

On Sat, Mar 21, 2015 at 12:39 AM, Karlson <> wrote:

> Hi all,
> where is the data stored that is passed to sc.parallelize? Or put
> differently, where is the data for the base RDD fetched from when the DAG
> is executed, if the base RDD is constructed via sc.parallelize?
> I am reading a csv file via the Python csv module and am feeding the
> parsed data chunkwise to sc.parallelize, because the whole file would not
> fit into memory on the driver. Reading the file with sc.textfile first is
> not an option, as there might be linebreaks inside the csv fields,
> preventing me from parsing the file line by line.
> The problem I am facing right now is that even though I am feeding only
> one chunk at a time to Spark, I will eventually run out of memory on the
> driver.
> Thanks in advance!
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message