spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: recent join/iterator fix
Date Mon, 29 Dec 2014 09:42:22 GMT
It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup. Yes, it was a matter of not
materializing the entire result of a flatMap-like function after the
cogroup, since this will accept just an Iterator (actually,
TraversableOnce).

I'd say that wherever you flatMap a large-ish value to another one,
you should consider this pattern, yes.

I think this may also be a case where Scala's lazy collections (with
.view) could be useful?

On Mon, Dec 29, 2014 at 4:28 AM, Stephen Haberman
<stephen.haberman@gmail.com> wrote:
> Hey,
>
> I saw this commit go by, and find it fairly fascinating:
>
> https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305
>
> For two reasons: 1) we have a report that is bogging down exactly in
> a .join with lots of elements, so, glad to see the fix, but, more
> interesting I think:
>
> 2) If such a subtle bug was lurking in spark-core, it leaves me worried
> that every time we use .map in our own cogroup code, that we'll be
> committing the same perf error.
>
> Has anyone thought more deeply about whether this is a big deal or not?
> Should ".iterator.map" vs. ".map" be strongly preferred/best practice
> for cogroup code?
>
> Thanks,
> Stephen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message