crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: map side ("replicated") joins in Crunch
Date Thu, 21 Jun 2012 06:40:07 GMT
On Wed, Jun 20, 2012 at 11:20 PM, Gabriel Reid <> wrote:
> I'd be happy to volunteer to own this, unless there is someone else
> who feels more motivated or better placed to do that.
> In the meantime, I'm certainly interested in going further with the
> discussion of how to actually make this available in the API. Any
> other opinions, or people for or against any of the approaches that
> have already been proposed?

So there's a philosophical issue here: should Crunch ever make
decisions about how to do something itself based on its estimates of
the size of the data sets, or should it always do exactly what the
developer indicates? Pig and Hive make these kinds of decisions all
the time, but do have ways for you to force a particular join strategy
to be used. Cascading uses magic less often (and never with joins) but
still employs some around serialization/deserialization of data.

I can make a case either way, but I think that no matter what, we
would want to have explicit functions for performing a join that reads
one data set into memory, so I think we can proceed w/the
implementation while folks weigh in on what their preferences are for
the default join() behavior (e.g., just do a reduce-side join, or try
to figure out the best join given information about the input data and
some configuration parameters.)

> - Gabriel
> On Thu, Jun 21, 2012 at 12:31 AM, Josh Wills <> wrote:
>> I want to close the loop on this, since it's clearly a popular
>> feature. I know we don't have JIRA yet, but does anyone want to
>> volunteer to own map-side join development? I'm happy to help out with
>> the work, but I have a lot of Crunch documentation to write and I want
>> to make that my primary focus.
>> On Sat, Jun 16, 2012 at 1:49 AM, Gabriel Reid <> wrote:
>>>>> One of the functions that I find most useful in Pig is the map side
>>>>> join; Pig will put a file in the distributed cache, load it into
>>>>> memory, and do a join from the mappers. I'd like to add this to
>>>>> Crunch, but wasn't sure what the best way to do this would be. Do any
>>>>> of you guys have any thoughts on this?
>>>> I have a few, but they're not quite baked yet. We should have some
>>>> other folks weigh in.
>>> Map-side joins is definitely #1 on my wish list of things for Crunch,
>>> and it's also something I've been thinking about a lot lately in terms
>>> of how to implement it.
>>> One of the ideas that I've had about this is adding an overload of the
>>> join method on PTable to allow supplying join settings, for example
>>> something like this:
>>>    JoinSettings joinSettings = new JoinSettings();
>>>    joinSettings.setJoinOperation(JoinSettings.LeftOuterJoin);
>>>    joinSettings.allowMapsideJoin();
>>>    PTable<K, Pair<U,V>> joined = tableA.join(tableB, joinSettings);
>>> The idea is that you could let Crunch decide (at the time of job
>>> creation) if a join would be done in memory or not, depending on the
>>> size of (one of) the incoming tables or any other heuristics. If a
>>> join is performed with a JoinSettings that has allowMapsideJoin set,
>>> then obviously the developer needs to be aware that there is a good
>>> chance that the joined table won't be sorted (which will be the case
>>> if a standard join is used).
>>> Obviously this approach removes some control from the user in terms of
>>> what exactly happens under the covers, so that's something that we
>>> would need to take into account. However, in my day job situations
>>> come up quite often where we're using the same code to deal with both
>>> large joins and small joins depending on the dataset, so it would be
>>> nice to use the same Crunch flow for all cases. Of course, it's also
>>> an option to just write this explicitly instead of baking it directly
>>> into Crunch.
>>> In any case, I'm definitely in favor of having map-side joins be
>>> possible (and easy) with PCollections, and not only with
>>> java.util.Collections. There are definitely use cases where you have a
>>> huge dataset that you want to reduce/aggregate down to a small dataset
>>> and then join with another huge dataset.
>>> Definitely happy to hear that other people are interested in having
>>> map-side joins as well!
>>> - Gabriel
>> --
>> Director of Data Science
>> Cloudera
>> Twitter: @josh_wills

Director of Data Science
Twitter: @josh_wills

View raw message