spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Piccolboni <anto...@piccolboni.info>
Subject Re: groupByKey() and keys with many values
Date Mon, 07 Sep 2015 19:11:40 GMT
To expand on what Sean said, I would look into replacing groupByKey with
reduceByKey. Also take a look at this doc
<http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html>.
I happen to have designed a library that was subject to the same criticism
when compared to the java mapreduce API wrt the use of iterables, but
neither we nor the critics could ever find a natural example of a problem
when you can express a computation as a single pass through each group
while using a constant amount of memory  that could not be converted to
using a combiner (mapreduce jargon, called a reduce in Spark and most
functional circles). If  you found such an example, while an obstacle for
you,  it would be of some  interest to know what it is.


On Mon, Sep 7, 2015 at 1:31 AM Sean Owen <sowen@cloudera.com> wrote:

> That's how it's intended to work; if it's a problem, you probably need
> to re-design your computation to not use groupByKey. Usually you can
> do so.
>
> On Mon, Sep 7, 2015 at 9:02 AM, kaklakariada <christoph.pirkl@gmail.com>
> wrote:
> > Hi,
> >
> > I already posted this question on the users mailing list
> > (
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html
> )
> > but did not get a reply. Maybe this is the correct forum to ask.
> >
> > My problem is, that doing groupByKey().mapToPair() loads all values for a
> > key into memory which is a problem when the values don't fit into memory.
> > This was not a problem with Hadoop map/reduce, as the Iterable passed to
> the
> > reducer read from disk.
> >
> > In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer
> > containing all values.
> >
> > Is it possible to change this behavior without modifying Spark, or is
> there
> > a plan to change this?
> >
> > Thank you very much for your help!
> > Christoph.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
View raw message