crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: confused about the MapsideJoinStrategy, why use LoadLeftSideMapsideJoinStrategy, what if left table is too large to store in memory?
Date Thu, 12 May 2016 19:33:38 GMT

Sorry for taking so long to get back to you on this.

The reason for switching the switch (deprecating the original
MapsideJoinStrategy, and adding the create() method which reverses
sides) is because all other join strategies in Crunch are designed to
have the smaller of the two PCollections given as the left side of the
join. The original MapsideJoinStrategy was implemented with the right
side being loaded into memory, which wasn't in line with the other

In order to remain backwards compatible with existing code which
relies on the right-side in memory behavior, but bring future code in
line with always having the left-side table as the smaller of the two
tables, the constructor was deprecated and the create method was
added. The plan is to remove the public MapsideJoin constructor in a
future version of Crunch.

- Gabriel

On Wed, May 11, 2016 at 3:42 AM, 陈竞 <> wrote:
> mapsideJoinStrategy.create()  use LoadLeftSideMapsideJoinStrategy, i'm just
> confused why LoadLeftSideMapsideJoinStrategy is better than default
> strategy.
> according to the annotation, LoadLeftSideMapsideJoinStrategy peforms better
> than default strategy, but i don't know why
> 2016-05-10 11:30 GMT+08:00 David Ortiz <>:
>> Try mapsideJoinStrategy.create()
>> On Mon, May 9, 2016, 9:29 PM 陈竞 <> wrote:
>>> hi, i'm very confused when i use MapsideJoinStrategy. the origin
>>> constructor was deprecated, instead, LoadLeftSideMapsideJoinStrategy was
>>> recommended, the main improvement is that load left side table in memory,
>>> whose size is large than right side. however, when i want to use mas side
>>> join, the left side table usually is too large to store in memory.
>>> for example i have to table A and B, we need A left join B, and
>>> size(A)>>size(B), naturally we want to use map side join, and use A as
>>> side, B as right side, then load B in memory to process, it's very simple.
>>> However, if we use LoadLeftSideMapsideJoinStrategy, we use A as right side,
>>> B as left side, which makes no improvement while adding a reverse DoFn
>>> --
>>> 陈竞,中科院计算技术研究所,高性能计算机中心
>>> Jing Chen HPCC.ICT.AC China
> --
> 陈竞,中科院计算技术研究所,高性能计算机中心
> Jing Chen HPCC.ICT.AC China

View raw message