spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@plainvanillagames.com>
Subject Re: How to use cluster for large set of linux files
Date Wed, 22 Jan 2014 21:03:09 GMT
Manoj,

large is a relative term ;)

NFS is a rather slow solution, at least that's always been my experience.
However, it will work for smaller files.

One way to do it is to put the files in S3 on Amazon. However, then your
network becomes a limiting factor.

The other way to do it is to replicate all the files on each node but that
can get tedious and depends on how much disk space you have, may not be an
option.

Finally there are things like http://code.google.com/p/mogilefs/ but they
seem to need a special library to read a file - probably would need some
kind of patching of spark to make it work since it may not expose the usual
filesystem interface. However, it could be a viable solution, I am just
starting to play with it.

Ognen


On Wed, Jan 22, 2014 at 8:37 PM, Manoj Samel <manojsameltech@gmail.com>wrote:

> I have a set of csv files that I want to read as a single RDD using a
> stand alone cluster.
>
> These file reside on one machine right now. If I start a cluster with
> multiple worker nodes, how do I use these worker nodes to read the files
> and do the RDD computation ? Do I have to copy the files on every worker
> node ?
>
> Assume that copying these into a HDFS is not a option for now ..
>
> Thanks,
>



-- 
"Le secret des grandes fortunes sans cause apparente est un crime oublié,
parce qu'il a été proprement fait" - Honore de Balzac

Mime
View raw message