spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Storage of RDDs created via sc.parallelize
Date Mon, 23 Mar 2015 06:52:38 GMT
You can use sc.newAPIHadoopFile
<http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.SparkContext>
with CSVInputFormat <https://github.com/mvallebr/CSVInputFormat> so that it
will read the csv file properly.

Thanks
Best Regards

On Sat, Mar 21, 2015 at 12:39 AM, Karlson <ksonspark@siberie.de> wrote:

>
> Hi all,
>
> where is the data stored that is passed to sc.parallelize? Or put
> differently, where is the data for the base RDD fetched from when the DAG
> is executed, if the base RDD is constructed via sc.parallelize?
>
> I am reading a csv file via the Python csv module and am feeding the
> parsed data chunkwise to sc.parallelize, because the whole file would not
> fit into memory on the driver. Reading the file with sc.textfile first is
> not an option, as there might be linebreaks inside the csv fields,
> preventing me from parsing the file line by line.
>
> The problem I am facing right now is that even though I am feeding only
> one chunk at a time to Spark, I will eventually run out of memory on the
> driver.
>
> Thanks in advance!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message