crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: map side ("replicated") joins in Crunch
Date Thu, 21 Jun 2012 06:20:48 GMT
I'd be happy to volunteer to own this, unless there is someone else
who feels more motivated or better placed to do that.

In the meantime, I'm certainly interested in going further with the
discussion of how to actually make this available in the API. Any
other opinions, or people for or against any of the approaches that
have already been proposed?

- Gabriel

On Thu, Jun 21, 2012 at 12:31 AM, Josh Wills <jwills@cloudera.com> wrote:
> I want to close the loop on this, since it's clearly a popular
> feature. I know we don't have JIRA yet, but does anyone want to
> volunteer to own map-side join development? I'm happy to help out with
> the work, but I have a lot of Crunch documentation to write and I want
> to make that my primary focus.
>
> On Sat, Jun 16, 2012 at 1:49 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
>>>> One of the functions that I find most useful in Pig is the map side
>>>> join; Pig will put a file in the distributed cache, load it into
>>>> memory, and do a join from the mappers. I'd like to add this to
>>>> Crunch, but wasn't sure what the best way to do this would be. Do any
>>>> of you guys have any thoughts on this?
>>>
>>> I have a few, but they're not quite baked yet. We should have some
>>> other folks weigh in.
>>>
>>
>> Map-side joins is definitely #1 on my wish list of things for Crunch,
>> and it's also something I've been thinking about a lot lately in terms
>> of how to implement it.
>>
>> One of the ideas that I've had about this is adding an overload of the
>> join method on PTable to allow supplying join settings, for example
>> something like this:
>>
>>    JoinSettings joinSettings = new JoinSettings();
>>    joinSettings.setJoinOperation(JoinSettings.LeftOuterJoin);
>>    joinSettings.allowMapsideJoin();
>>    PTable<K, Pair<U,V>> joined = tableA.join(tableB, joinSettings);
>>
>> The idea is that you could let Crunch decide (at the time of job
>> creation) if a join would be done in memory or not, depending on the
>> size of (one of) the incoming tables or any other heuristics. If a
>> join is performed with a JoinSettings that has allowMapsideJoin set,
>> then obviously the developer needs to be aware that there is a good
>> chance that the joined table won't be sorted (which will be the case
>> if a standard join is used).
>>
>> Obviously this approach removes some control from the user in terms of
>> what exactly happens under the covers, so that's something that we
>> would need to take into account. However, in my day job situations
>> come up quite often where we're using the same code to deal with both
>> large joins and small joins depending on the dataset, so it would be
>> nice to use the same Crunch flow for all cases. Of course, it's also
>> an option to just write this explicitly instead of baking it directly
>> into Crunch.
>>
>> In any case, I'm definitely in favor of having map-side joins be
>> possible (and easy) with PCollections, and not only with
>> java.util.Collections. There are definitely use cases where you have a
>> huge dataset that you want to reduce/aggregate down to a small dataset
>> and then join with another huge dataset.
>>
>> Definitely happy to hear that other people are interested in having
>> map-side joins as well!
>>
>> - Gabriel
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills

Mime
View raw message