Author: eyal
Date: Tue Jan 29 19:35:44 2019
New Revision: 1852471
URL: http://svn.apache.org/viewvc?rev=1852471&view=rev
Log:
Add forgotten svn blog post to static content
Added:
datafu/site/blog/2019/01/29/
datafu/site/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html
Added: datafu/site/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html
URL: http://svn.apache.org/viewvc/datafu/site/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html?rev=1852471&view=auto
==============================================================================
--- datafu/site/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html (added)
+++ datafu/site/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html Tue Jan 29
19:35:44 2019
@@ -0,0 +1,299 @@
+
+
+
+<!doctype html>
+<html>
+ <head>
+ <meta charset="utf-8">
+
+ <!-- Always force latest IE rendering engine or request Chrome Frame -->
+ <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
+ <meta name="google-site-verification" content="9N7qTOUYyX4kYfXYc0OIomWJku3PVvGrf6oTNWg2CHI"
/>
+
+ <meta name="twitter:card" content="summary" />
+ <meta name="twitter:site" content="@apachedatafu" />
+ <meta name="twitter:title" content="A Look into PayPalâs Contributions to Apache
DataFu" />
+ <meta name="twitter:description" content=" Photo by Louis Reed on Unsplash As with many
Apache projects with robust communities and growing ecosystems, Apache DataFu has contributions
from individual code committers employed by various organizations. Users of Apache projects
who contribute..." />
+ <meta property="og:url" content="http://datafu.apache.org/blog/2019/01/29/a-look-at-paypals-contributions-to-datafu.html"
/>
+ <meta property="og:type" content="article" />
+ <meta property="og:title" content="A Look into PayPalâs Contributions to Apache
DataFu" />
+ <meta property="og:description" content=" Photo by Louis Reed on Unsplash As with many
Apache projects with robust communities and growing ecosystems, Apache DataFu has contributions
from individual code committers employed by various organizations. Users of Apache projects
who contribute..." />
+
+
+ <!-- Use title if it's in the page YAML frontmatter -->
+ <title>A Look into PayPalâs Contributions to Apache DataFu</title>
+
+ <link href="/stylesheets/all.css" rel="stylesheet" /><link href="/stylesheets/highlight.css"
rel="stylesheet" />
+ <script src="/javascripts/all.js"></script>
+
+ <script type="text/javascript">
+ var _gaq = _gaq || [];
+ _gaq.push(['_setAccount', 'UA-30533336-2']);
+ _gaq.push(['_trackPageview']);
+
+ (function() {
+ var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async
= true;
+ ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www')
+ '.google-analytics.com/ga.js';
+ var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga,
s);
+ })();
+ </script>
+ </head>
+
+ <body class="blog blog_2019 blog_2019_01 blog_2019_01_29 blog_2019_01_29_a-look-at-paypals-contributions-to-datafu">
+
+ <div class="container">
+
+
+<div class="header">
+
+ <ul class="nav nav-pills pull-right">
+ <li><a href="/blog">Blog</a></li>
+ </ul>
+
+ <h3 class="header-title"><a href="/">Apache DataFu™</a></h3>
+
+</div>
+
+
+ <div class="row">
+ <article class="col-lg-10">
+ <h1>A Look into PayPalâs Contributions to Apache DataFu</h1>
+ <h5 class="text-muted"><time>Jan 29, 2019</time></h5>
+ <h5 class="text-muted">Eyal Allweil</h5>
+
+ <hr>
+
+ <p><img alt="1*rzrpfvbz7 idjxy 6teaxq" src="https://cdn-images-1.medium.com/max/1600/1*RZRPFvbZ7_IdJxY-6TeaxQ.jpeg"
/></p>
+
+<p>Photo by <a href="https://unsplash.com/photos/pwcKF7L4-no?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Louis
Reed</a> on <a href="https://unsplash.com/search/photos/test-tubes?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></p>
+
+<p><strong><em>As with many Apache projects with robust communities and
growing ecosystems,</em></strong> <a href="http://datafu.apache.org/"><strong><em>Apache
DataFu</em></strong></a> <strong><em>has contributions from
individual code committers employed by various organizations. Users of Apache projects who
contribute code back to the project benefits everyone. This is PayPal's story.</em></strong></p>
+
+<p>At PayPal, we often work on large datasets in a Hadoop environmentâââcrunching
up to petabytes of data and using a variety of sophisticated tools in order to fight fraud.
One of the tools we use to do so is <a href="https://pig.apache.org/">Apache Pig</a>.
Pig is a simple, high-level programming language that consists of just a few dozen operators,
but it allows you to write powerful queries and transformations over Hadoop.</p>
+
+<p>It also allows you to extend Pigâs capabilities by writing macros and UDFâs
(user defined functions). At PayPal, weâve written a variety of both, and contributed
many of them to the <a href="http://datafu.apache.org/">Apache DataFu</a> project.
In this blog post weâd like to explain what weâve contributed and present a guide
to how we use them.</p>
+
+<hr>
+
+<p><br></p>
+
+<p><strong>1. Finding the most recent update of a given recordâââthe
<em>dedup</em> (de-duplication) macro</strong></p>
+
+<p>A common scenario in data sent to the HDFSâââthe Hadoop Distributed
File Systemâââis multiple rows representing updates for the same logical data.
For example, in a table representing accounts, a record might be written every time customer
data is updated, with each update receiving a newer timestamp. Letâs consider the following
simplified example.</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
+<center>Raw customersâ data, with more than one row per customer</center>
+<br></p>
+
+<p>We can see that though most of the customers only appear once, <em>julia</em>
and <em>quentin</em> have 2 and 3 rows, respectively. How can we get just the
most recent record for each customer? For this we can use the <em>dedup</em> macro,
as below:</p>
+<pre class="highlight pig"><code><span class="k">REGISTER</span>
<span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span
class="o">-</span><span class="mi">1</span><span class="p">.</span><span
class="mi">5</span><span class="p">.</span><span class="mi">0</span><span
class="p">.</span><span class="n">jar</span><span class="p">;</span>
+
+<span class="k">IMPORT</span> <span class="s1">'datafu/dedup.pig'</span><span
class="p">;</span>
+
+<span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span>
<span class="s1">'customers.csv'</span> <span class="k">AS</span>
<span class="p">(</span><span class="n">id</span><span class="p">:</span>
<span class="n">int</span><span class="p">,</span> <span class="n">name</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">,</span>
<span class="n">purchases</span><span class="p">:</span> <span
class="n">int</span><span class="p">,</span> <span class="n">date_updated</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
+
+<span class="n">dedup_data</span> <span class="o">=</span> <span
class="n">dedup</span><span class="p">(</span><span class="n">data</span><span
class="p">,</span> <span class="s1">'id'</span><span class="p">,</span>
<span class="s1">'date_updated'</span><span class="p">);</span>
+
+<span class="k">STORE</span> <span class="n">dedup_data</span> <span
class="k">INTO</span> <span class="s1">'dedup_out'</span><span class="p">;</span>
+</code></pre>
+
+<p>Our result will be as expectedâââeach customer only appears once,
as you can see below:</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
+<center>âDeduplicatedâ data, with only the most recent record for each customer</center>
+<br></p>
+
+<p>One nice thing about this macro is that you can use more than one field to dedup
the data. For example, if we wanted to use both the <em>id</em> and <em>name</em>
fields, we would change this line:</p>
+<pre class="highlight pig"><code><span class="n">dedup_data</span>
<span class="o">=</span> <span class="n">dedup</span><span class="p">(</span><span
class="n">data</span><span class="p">,</span> <span class="s1">'id'</span><span
class="p">,</span> <span class="s1">'date_updated'</span><span class="p">);</span>
+</code></pre>
+
+<p>to this:</p>
+<pre class="highlight pig"><code><span class="n">dedup_data</span>
<span class="o">=</span> <span class="n">dedup</span><span class="p">(</span><span
class="n">data</span><span class="p">,</span> <span class="s1">'(id,
name)'</span><span class="p">,</span> <span class="s1">'date_updated'</span><span
class="p">);</span>
+</code></pre>
+
+<hr>
+
+<p><br></p>
+
+<p><strong>2. Preparing a sample of records based on a list of keysâââthe
sample_by_keys macro.</strong></p>
+
+<p>Another common use case weâve encountered is the need to prepare a sample based
on a small subset of records. DataFu already includes a number of UDFâs for sampling
purposes, but they are all based on random selection. Sometimes, at PayPal, we needed to be
able to create a table representing a manually-chosen sample of customers, but with exactly
the same fields as the original table. For that we use the <em>sample_by_keys</em>
macro. For example, letâs say we want customers 2, 4 and 6 from <em>customers.csv</em>.
If we have this list stored on the HDFS as <em>sample.csv</em>, we could use the
following Pig script:</p>
+<pre class="highlight pig"><code><span class="k">REGISTER</span>
<span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span
class="o">-</span><span class="mi">1</span><span class="p">.</span><span
class="mi">5</span><span class="p">.</span><span class="mi">0</span><span
class="p">.</span><span class="n">jar</span><span class="p">;</span>
+
+<span class="k">IMPORT</span> <span class="s1">'datafu/sample_by_keys.pig'</span><span
class="p">;</span>
+
+<span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span>
<span class="s1">'customers.csv'</span> <span class="k">USING</span>
<span class="n">PigStorage</span><span class="p">(</span><span
class="s1">','</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span><span class="n">id</span><span class="p">:</span>
<span class="n">int</span><span class="p">,</span> <span class="n">name</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">,</span>
<span class="n">purchases</span><span class="p">:</span> <span
class="n">int</span><span class="p">,</span> <span class="n">updated</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
+
+<span class="n">customers</span> <span class="o">=</span> <span
class="k">LOAD</span> <span class="s1">'sample.csv'</span> <span class="k">AS</span>
<span class="p">(</span><span class="n">cust_id</span><span class="p">:</span>
<span class="n">int</span><span class="p">);</span>
+
+<span class="n">sampled</span> <span class="o">=</span> <span
class="n">sample_by_keys</span><span class="p">(</span><span class="n">data</span><span
class="p">,</span> <span class="n">customers</span><span class="p">,</span>
<span class="n">id</span><span class="p">,</span> <span class="n">cust_id</span><span
class="p">);</span>
+
+<span class="k">STORE</span> <span class="n">sampled</span> <span
class="k">INTO</span> <span class="s1">'sample_out'</span><span class="p">;</span>
+</code></pre>
+
+<p>The result will be all the records from our original table for customers 2, 4 and
6. Notice that the original row structure is preserved, and that customer 2ââ<em>âjulia</em>âââhas
two rows, as was the case in our original data. This is important for making sure that the
code that will run on this sample will behave exactly as it would on the original data.</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/28985cc0e3f338d044cc5ebb779f6454.js"></script>
+<center>Only customers 2, 4, and 6 appear in our new sample</center>
+<br></p>
+
+<hr>
+
+<p><br></p>
+
+<p><strong>3. Comparing expected and actual results for regression testsâââthe
diff_macro</strong></p>
+
+<p>After making changes in an applicationâs logic, we are often interested in
the effect they have on our output. One common use case is when we refactorâââwe
donât expect our output to change. Another is a surgical change which should only affect
a very small subset of records. For easily performing such regression tests on actual data,
we use the <em>diff_macro</em>, which is based on DataFuâs <em>TupleDiff</em>
UDF.</p>
+
+<p>Letâs look at a table which is exactly like <em>dedup_out</em>,
but with four changes.</p>
+
+<ol>
+<li> We will remove record 1, <em>quentin</em></li>
+<li> We will change <em>date_updated</em> for record 2, <em>julia</em></li>
+<li> We will change <em>purchases</em> and <em>date_updated</em>
for record 4, <em>alice</em></li>
+<li> We will add a new row, record 8, <em>amanda</em></li>
+</ol>
+
+<p><br>
+<script src="https://gist.github.com/eyala/699942d65471f3c305b0dcda09944a95.js"></script>
+<br></p>
+
+<p>Weâll run the following Pig script, using DataFuâs <em>diff_macro</em>:</p>
+<pre class="highlight pig"><code><span class="k">REGISTER</span>
<span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span
class="o">-</span><span class="mi">1</span><span class="p">.</span><span
class="mi">5</span><span class="p">.</span><span class="mi">0</span><span
class="p">.</span><span class="n">jar</span><span class="p">;</span>
+
+<span class="k">IMPORT</span> <span class="s1">'datafu/diff_macros.pig'</span><span
class="p">;</span>
+
+<span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span>
<span class="s1">'dedup_out.csv'</span> <span class="k">USING</span>
<span class="n">PigStorage</span><span class="p">(</span><span
class="s1">','</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span><span class="n">id</span><span class="p">:</span>
<span class="n">int</span><span class="p">,</span> <span class="n">name</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">,</span>
<span class="n">purchases</span><span class="p">:</span> <span
class="n">int</span><span class="p">,</span> <span class="n">date_updated</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
+
+<span class="n">changed</span> <span class="o">=</span> <span
class="k">LOAD</span> <span class="s1">'dedup_out_changed.csv'</span>
<span class="k">USING</span> <span class="n">PigStorage</span><span
class="p">(</span><span class="s1">','</span><span class="p">)</span>
<span class="k">AS</span> <span class="p">(</span><span class="n">id</span><span
class="p">:</span> <span class="n">int</span><span class="p">,</span>
<span class="n">name</span><span class="p">:</span> <span class="n">chararray</span><span
class="p">,</span> <span class="n">purchases</span><span class="p">:</span>
<span class="n">int</span><span class="p">,</span> <span class="n">date_updated</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">);</span>
+
+<span class="n">diffs</span> <span class="o">=</span> <span class="n">diff_macro</span><span
class="p">(</span><span class="n">data</span><span class="p">,</span><span
class="n">changed</span><span class="p">,</span><span class="n">id</span><span
class="p">,</span><span class="s1">''</span><span class="p">);</span>
+
+<span class="k">DUMP</span> <span class="n">diffs</span><span
class="p">;</span>
+</code></pre>
+
+<p>The results look like this:</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/3d36775faf081daad37a102f25add2a4.js"></script>
+<br></p>
+
+<p>Letâs take a moment to look at these results. They have the same general structure.
Rows that start with <em>missing</em> indicate records that were in the first
relation, but arenât in the new one. Conversely, rows that start with <em>added</em>
indicate records that are in the new relation, but not in the old one. Each of these rows
is followed by the relevant tuple from the relations.</p>
+
+<p>The rows that start with <em>changed</em> are more interesting. The
word <em>changed</em> is followed by a list of the fields which have changed values
in the new table. For the row with <em>id</em> 2, this is the <em>date_updated</em>
field. For the row with <em>id</em> 4, this is the <em>purchases</em>
and <em>date_updated</em> fields.</p>
+
+<p>Obviously, one thing we might want to ignore is the <em>date_updated</em>
field. If the only difference in the fields is when it was last updated, we might just want
to skip these records for a more concise diff. For this, we need to change the following row
in our original Pig script, from this:</p>
+<pre class="highlight pig"><code><span class="n">diffs</span> <span
class="o">=</span> <span class="n">diff_macro</span><span class="p">(</span><span
class="n">data</span><span class="p">,</span><span class="n">changed</span><span
class="p">,</span><span class="n">id</span><span class="p">,</span><span
class="s1">''</span><span class="p">);</span>
+</code></pre>
+
+<p>to become this:</p>
+<pre class="highlight pig"><code><span class="n">diffs</span> <span
class="o">=</span> <span class="n">diff_macro</span><span class="p">(</span><span
class="n">data</span><span class="p">,</span><span class="n">changed</span><span
class="p">,</span><span class="n">id</span><span class="p">,</span><span
class="s1">'date_updated'</span><span class="p">);</span>
+</code></pre>
+
+<p>If we run our changed Pig script, weâll get the following result.</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/d9b0d5c60ad4d8bbccc79c3527f99aca.js"></script>
+<br></p>
+
+<p>The row for <em>julia</em> is missing from our diff, because only <em>date_updated</em>
has changed, but the row for <em>alice</em> still appears, because the <em>purchases</em>
field has also changed.</p>
+
+<p>Thereâs one implementation detail thatâs important to knowâââthe
macro uses a replicated join in order to be able to run quickly on very large tables, so the
sample table needs to be able to fit in memory.</p>
+
+<hr>
+
+<p><br></p>
+
+<p><strong>4. Counting distinct records, but only up to a limited amountâââthe
<em>CountDistinctUpTo</em> UDF</strong></p>
+
+<p>Sometimes our analytical logic requires us to filter out accounts that donât
have enough data. For example, we might want to look only at customers with a certain small
minimum number of transactions. This is not difficult to do in Pig; you can group by the customerâs
id, count the number of distinct transactions, and filter out the customers that donât
have enough.</p>
+
+<p>Letâs use following table as an example:</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/73dc69d0b5f513c53c4dac72c71daf7c.js"></script>
+<br></p>
+
+<p>You can use the following âpureâ Pig script to get the number of distinct
transactions per name:</p>
+<pre class="highlight pig"><code><span class="n">data</span> <span
class="o">=</span> <span class="k">LOAD</span> <span class="s1">'transactions.csv'</span>
<span class="k">USING</span> <span class="n">PigStorage</span><span
class="p">(</span><span class="s1">','</span><span class="p">)</span>
<span class="k">AS</span> <span class="p">(</span><span class="n">name</span><span
class="p">:</span> <span class="n">chararray</span><span class="p">,</span>
<span class="n">transaction_id</span><span class="p">:</span><span
class="n">int</span><span class="p">);</span>
+
+<span class="n">grouped</span> <span class="o">=</span> <span
class="k">GROUP</span> <span class="n">data</span> <span class="k">BY</span>
<span class="n">name</span><span class="p">;</span>
+
+<span class="n">counts</span> <span class="o">=</span> <span class="k">FOREACH</span>
<span class="n">grouped</span> <span class="p">{</span>
+ <span class="n">distincts</span> <span class="o">=</span> <span
class="k">DISTINCT</span> <span class="n">data</span><span class="p">.</span><span
class="n">transaction_id</span><span class="p">;</span>
+ <span class="k">GENERATE</span> <span class="k">group</span><span
class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span
class="n">distincts</span><span class="p">)</span> <span class="k">AS</span>
<span class="n">distinct_count</span><span class="p">;</span>
+ <span class="p">};</span>
+
+<span class="k">DUMP</span> <span class="n">counts</span><span
class="p">;</span>
+</code></pre>
+
+<p>This will produce the following output:</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/a9cd0ffb99039758f63b9d08c40b1124.js"></script>
+<br></p>
+
+<p>Note that Julia has a count of 1, because although she has 2 rows, they have the
same transaction id.</p>
+
+<p>However, accounts in PayPal can differ wildly in their scope. For example, a transactions
table might have only a few purchases for an individual, but millions for a large company.
This is an example of data skew, and the procedure I described above would not work effectively
in such cases. This has to do with how Pig translates the nested foreach statementâââit
will keep all the distinct records in memory while counting.</p>
+
+<p>In order to get the same count with much better performance, you can use the <em>CountDistinctUpTo</em>
UDF. Letâs look at the following Pig script, which counts distinct transactions up to
3 and 5:</p>
+<pre class="highlight pig"><code><span class="k">REGISTER</span>
<span class="n">datafu</span><span class="o">-</span><span class="n">pig</span><span
class="o">-</span><span class="mi">1</span><span class="p">.</span><span
class="mi">5</span><span class="p">.</span><span class="mi">0</span><span
class="p">.</span><span class="n">jar</span><span class="p">;</span>
+
+<span class="k">DEFINE</span> <span class="n">CountDistinctUpTo3</span>
<span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span
class="p">.</span><span class="n">bags</span><span class="p">.</span><span
class="n">CountDistinctUpTo</span><span class="p">(</span><span class="s1">'3'</span><span
class="p">);</span>
+<span class="k">DEFINE</span> <span class="n">CountDistinctUpTo5</span>
<span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span
class="p">.</span><span class="n">bags</span><span class="p">.</span><span
class="n">CountDistinctUpTo</span><span class="p">(</span><span class="s1">'5'</span><span
class="p">);</span>
+
+<span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span>
<span class="s1">'transactions.csv'</span> <span class="k">USING</span>
<span class="n">PigStorage</span><span class="p">(</span><span
class="s1">','</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span><span class="n">name</span><span class="p">:</span>
<span class="n">chararray</span><span class="p">,</span> <span
class="n">transaction_id</span><span class="p">:</span><span class="n">int</span><span
class="p">);</span>
+
+<span class="n">grouped</span> <span class="o">=</span> <span
class="k">GROUP</span> <span class="n">data</span> <span class="k">BY</span>
<span class="n">name</span><span class="p">;</span>
+
+<span class="n">counts</span> <span class="o">=</span> <span class="k">FOREACH</span>
<span class="n">grouped</span> <span class="k">GENERATE</span> <span
class="k">group</span><span class="p">,</span><span class="n">CountDistinctUpTo3</span><span
class="p">(</span><span class="n">$1</span><span class="p">)</span>
<span class="k">as</span> <span class="n">cnt3</span><span class="p">,</span>
<span class="n">CountDistinctUpTo5</span><span class="p">(</span><span
class="n">$1</span><span class="p">)</span> <span class="k">AS</span>
<span class="n">cnt5</span><span class="p">;</span>
+
+<span class="k">DUMP</span> <span class="n">counts</span><span
class="p">;</span>
+</code></pre>
+
+<p>This results in the following output:</p>
+
+<p><br>
+<script src="https://gist.github.com/eyala/19e22fb251fe2222b3ccea6f78e37a85.js"></script>
+<br></p>
+
+<p>Notice that when we ask <em>CountDistinctUpTo</em> to stop at 3, <em>quentin</em>
gets a count of 3, even though he has 4 transactions. When we use 5 as a parameter to <em>CountDistinctUpTo</em>,
he gets the actual count of 4.</p>
+
+<p>In our example, thereâs no real reason to use the <em>CountDistinctUpTo</em>
UDF. But in our ârealâ use case, stopping the count at a small number instead of
counting millions saves resources and time. The improvement is because the UDF doesnât
need to keep all the records in memory in order to return the desired result.</p>
+
+<hr>
+
+<p><br></p>
+
+<p>I hope that Iâve managed to explain how to use our new contributions to DataFu.
You can find all of the files used in this post by clicking the GitHub gists.</p>
+
+<hr>
+
+<p>A version of this post has appeared in the <a href="https://medium.com/paypal-engineering/a-guide-to-paypals-contributions-to-apache-datafu-b30cc25e0312">PayPal
Engineering Blog.</a></p>
+
+
+ </article>
+</div>
+
+
+
+<div class="footer">
+
+<div class="feather">
+<a href="http://www.apache.org/" target="_blank"><img src="/images/feather.png"
alt="Apache Feather" title="Apache Feather"/></a>
+</div>
+
+<div class="copyright">
+ Copyright © 2011-2019 The Apache Software Foundation, Licensed under the <a
href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br>
+ Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather
logo are either registered trademarks or trademarks of the <a href="http://www.apache.org/">Apache
Software Foundation</a> in the United States and other countries.
+</div>
+</div>
+
+ </div>
+
+ </body>
+</html>
\ No newline at end of file
|