crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <>
Subject JoinFn queries
Date Fri, 16 Nov 2012 13:42:58 GMT

Have few queries around JoinFn

1. Which all join function need one of the PTables in memory? from
documentation, I could get MapsideJoin has this.

2. I am playing around with JoinFn to merge two datasets, scenario is
detailed below.

Scenario: Cooked this up to play around with Crunch

One file has Ads Returned and time stamp in format
<Ad Id>, <long timestamp>

Other file has just Ad Ids, for which impressions were received
<Ad Id>

The objective is to join the data so that we can know which Ads got
impressions and impression table would be 90%(random) the size of Ads
table. In short, the table cannot fit in memory.

The way I did the join is, load both of them in PTable. For Ads returned
table (Ad Id, timestamp) and for Impression Table, its Ad Id and an Integer

And join them using the code

PTable<String, Pair<Long, Long>> joinedData =
Join.leftJoin(adsReturnedTable, impressionTable);

return is Ad Id, timestamp, Is Impressed

The code is working for small test data set. One problem I am facing is,
for the Ad Ids, where impression is not present, the output is like

a18f1f89-21e1-4fa9-8d24-54702fb9bdeb [1353062206438,]

for other it's
f2978128-6e40-4edb-ad3a-5e0ce5e11440 [1353062206479,1]

a. How can I make a 0 (zero) appear when the match is not found. From my
exploration, I need to write join(), and add check on pair.second() while
emitting. Is there a another way for achieve this.

3. How can be hook custom output formatter while writing PTable. like for
the above output, want to get something like


I plan to publish the finished code and all the finding in 4th blog post on


My Photo Galleries:

View raw message