pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohini Palaniswamy <roh...@apache.org>
Subject Re: Blog post on recent Pig content contributed to Apache DataFu
Date Tue, 22 Jan 2019 19:07:18 GMT
Thanks Eyal. dedup() sounds interesting and can find good use in nested
foreach for picking latest record. Unfortunate that you had to resort to
CountDistinctUpTo because of memory issues. We have run into similar issues
as well and have plans to optimize the nested count distinct for handling
millions of records. Created https://issues.apache.org/jira/browse/PIG-5378
for that.

On Sun, Jan 20, 2019 at 9:22 AM Russell Jurney <russell.jurney@gmail.com>
wrote:

> Nice!
>
> On Sun, Jan 20, 2019 at 8:28 AM Eyal Allweil <eyal@apache.org> wrote:
>
> > I wrote a blog post for the PayPal engineering blog detailing some of the
> > (Pig) content I've contributed to DataFu on behalf of PayPal. The post
> > contains documentation and code samples of three macros and a UDF:
> >
> > *dedup* - for deduplicating rows based on a key and date updated fields
> >
> > *sample_by_keys* - a macro for generating a sample of a table based on a
> > list of unique ids
> >
> > *diff_macro* - for generating a human readable diff between two tables
> >
> > *CountDistinctUpTo* - a UDF which performs much better than pure Pig for
> > cases in which you don't need the actual records, but just to verify
> that a
> > certain amount exists
> >
> >
> >
> https://medium.com/paypal-engineering/a-guide-to-paypals-contributions-to-apache-datafu-b30cc25e0312
> >
> > The blog post will be cross-posted to the Apache DataFu blog soon.
> >
> > Cheers,
> > Eyal
> >
> --
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jurney@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message