Folks,
Have few queries around JoinFn
1. Which all join function need one of the PTables in memory? from
documentation, I could get MapsideJoin has this.
2. I am playing around with JoinFn to merge two datasets, scenario is
detailed below.
Scenario: Cooked this up to play around with Crunch
One file has Ads Returned and time stamp in format
<Ad Id>, <long timestamp>
Other file has just Ad Ids, for which impressions were received
<Ad Id>
The objective is to join the data so that we can know which Ads got
impressions and impression table would be 90%(random) the size of Ads
table. In short, the table cannot fit in memory.
The way I did the join is, load both of them in PTable. For Ads returned
table (Ad Id, timestamp) and for Impression Table, its Ad Id and an Integer
And join them using the code
PTable<String, Pair<Long, Long>> joinedData =
Join.leftJoin(adsReturnedTable, impressionTable);
return is Ad Id, timestamp, Is Impressed
The code is working for small test data set. One problem I am facing is,
for the Ad Ids, where impression is not present, the output is like
a18f1f89-21e1-4fa9-8d24-54702fb9bdeb [1353062206438,]
for other it's
f2978128-6e40-4edb-ad3a-5e0ce5e11440 [1353062206479,1]
a. How can I make a 0 (zero) appear when the match is not found. From my
exploration, I need to write join(), and add check on pair.second() while
emitting. Is there a another way for achieve this.
3. How can be hook custom output formatter while writing PTable. like for
the above output, want to get something like
f2978128-6e40-4edb-ad3a-5e0ce5e11440,1353062206479,1
I plan to publish the finished code and all the finding in 4th blog post on
crunch.
--
thanks
ashish
Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal
|