spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com>
Subject RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?
Date Tue, 20 Jan 2015 15:55:20 GMT
Can anybody answer this? Do I have to have hdfs to achieve this?

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541

From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.wang@lexisnexis.com]
Sent: Friday, January 16, 2015 1:15 PM
To: Imran Rashid
Cc: user@spark.apache.org
Subject: RE: Can I save RDD to local file system and then read it back on spark cluster with
multiple nodes?

I need to save RDD to file system and then restore my RDD from the file system in the future.
I don’t have any hdfs file system and don’t want to go the hassle of setting up a hdfs
system. So how can I achieve this? The application need to be run on a cluster with multiple
nodes.

Regards,

Ningjun

From: imranrashi@gmail.com<mailto:imranrashi@gmail.com> [mailto:imranrashi@gmail.com]
On Behalf Of Imran Rashid
Sent: Friday, January 16, 2015 12:14 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Can I save RDD to local file system and then read it back on spark cluster with
multiple nodes?


I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to know something
about the file, eg to know how many tasks to create.  But it won't even be able to see the
file, since it only lives on the local filesystem of the cluster nodes.

If you really wanted to, you could probably write out some small metadata about the files
and write your own version of objectFile that uses it.  But I think there is a bigger conceptual
issue.  You might not in general be sure that you are running on the same nodes when you save
the file, as when you read it back in.  So the file might not be present on the local filesystem
for the active executors.  You might be able to guarantee it for the specific cluster setup
you have now, but it might limit you down the road.

What are you trying to achieve?  There might be a better way.  I believe writing to hdfs will
usually write one local copy, so you'd still be doing a local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, "Wang, Ningjun (LNG-NPV)" <ningjun.wang@lexisnexis.com<mailto:ningjun.wang@lexisnexis.com>>
wrote:
I have asked this question before but get no answer. Asking again.

Can I save RDD to the local file system and then read it back on a spark cluster with multiple
nodes?

rdd.saveAsObjectFile(“file:///home/data/rdd1<file:///\\home\data\rdd1>”)

val rdd2 = sc.objectFile(“file:///home/data/rdd1<file:///\\home\data\rdd1>”)

This will works if the cluster has only one node. But my cluster has 3 nodes and each node
has a local dir called /home/data. Is rdd saved to the local dir across 3 nodes? If so, does
sc.objectFile(…) smart enough to read the local dir in all 3 nodes to merge them into a
single rdd?

Ningjun

Mime
View raw message