spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: processing large dataset
Date Fri, 23 Jan 2015 09:49:28 GMT
This is kinda a how-long-is-a-piece-of-string question. There is no
one tuning for 'terabytes of data'. You can easily run a Spark job
that processes hundreds of terabytes with no problem with defaults --
something trivial like counting. You can create Spark jobs that will
never complete -- trying to pull the entire data set into a worker.

You haven't said what you're doing exactly, although it sounds simple,
and haven't said what the problem is? is it out of memory? that would
be essential to know to say what if anything you need to change in
your program or cluster.

On Fri, Jan 23, 2015 at 4:52 AM, Kane Kim <> wrote:
> I'm trying to process 5TB of data, not doing anything fancy, just
> map/filter and reduceByKey. Spent whole day today trying to get it
> processed, but never succeeded. I've tried to deploy to ec2 with the
> script provided with spark on pretty beefy machines (100 r3.2xlarge
> nodes). Really frustrated that spark doesn't work out of the box for
> anything bigger than word count sample. One big problem is that
> defaults are not suitable for processing big datasets, provided ec2
> script could do a better job, knowing instance type requested. Second
> it takes hours to figure out what is wrong, when spark job fails
> almost finished processing. Even after raising all limits as per
> it still fails (now
> with: error communicating with MapOutputTracker).
> After all I have only one question - how to get spark tuned up for
> processing terabytes of data and is there a way to make this
> configuration easier and more transparent?
> Thanks.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message