spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: recent join/iterator fix
Date Mon, 29 Dec 2014 09:42:22 GMT
It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup. Yes, it was a matter of not
materializing the entire result of a flatMap-like function after the
cogroup, since this will accept just an Iterator (actually,

I'd say that wherever you flatMap a large-ish value to another one,
you should consider this pattern, yes.

I think this may also be a case where Scala's lazy collections (with
.view) could be useful?

On Mon, Dec 29, 2014 at 4:28 AM, Stephen Haberman
<> wrote:
> Hey,
> I saw this commit go by, and find it fairly fascinating:
> For two reasons: 1) we have a report that is bogging down exactly in
> a .join with lots of elements, so, glad to see the fix, but, more
> interesting I think:
> 2) If such a subtle bug was lurking in spark-core, it leaves me worried
> that every time we use .map in our own cogroup code, that we'll be
> committing the same perf error.
> Has anyone thought more deeply about whether this is a big deal or not?
> Should "" vs. ".map" be strongly preferred/best practice
> for cogroup code?
> Thanks,
> Stephen
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message