spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: How to use cluster for large set of linux files
Date Wed, 22 Jan 2014 22:26:32 GMT
When you do foreach(println) on a cluster, that calls println *on the worker nodes*, so the
output goes to their stdout and stderr files instead of to your shell. To make sure it loaded
the file you should use operations that return the data locally, like .first() or .take().


On Jan 22, 2014, at 12:56 PM, Manoj Samel <> wrote:

> Thanks Matei.
> One thing I noticed after doing this and starting MASTER=spark://xxxx spark-shell is
everything works , BUT the xxx.foreach(println) prints blank line. All other logic seems working.
If I do a xx.count etc, I can see the value, just the println does not seems working
> On Wed, Jan 22, 2014 at 12:39 PM, Matei Zaharia <> wrote:
> Hi Manoj,
> You’d have to make the files available at the same path on each machine through something
like NFS. You don’t need to copy them, though that would also work.
> Matei
> On Jan 22, 2014, at 12:37 PM, Manoj Samel <> wrote:
> > I have a set of csv files that I want to read as a single RDD using a stand alone
> >
> > These file reside on one machine right now. If I start a cluster with multiple worker
nodes, how do I use these worker nodes to read the files and do the RDD computation ? Do I
have to copy the files on every worker node ?
> >
> > Assume that copying these into a HDFS is not a option for now ..
> >
> > Thanks,

View raw message