spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lukas nalezenec <lukas.naleze...@gmail.com>
Subject Re: Configuring Spark for reduceByKey on on massive data sets
Date Sun, 18 May 2014 10:30:45 GMT
Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec


On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> Make sure you set up enough reduce partitions so you don’t overload them.
> Another thing that may help is checking whether you’ve run out of local
> disk space on the machines, and turning on spark.shuffle.consolidateFiles
> to produce fewer files. Finally, there’s been a recent fix in both branch
> 0.9 and master that reduces the amount of memory used when there are small
> files (due to extra memory that was being taken by mmap()):
> https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
> either the 1.0 release candidates on the dev list, or branch-0.9 in git.
>
> Matei
>
> On May 17, 2014, at 5:45 PM, Madhu <madhu@madhu.com> wrote:
>
> > Daniel,
> >
> > How many partitions do you have?
> > Are they more or less uniformly distributed?
> > We have similar data volume currently running well on Hadoop MapReduce
> with
> > roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -----
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Mime
View raw message