datafu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mha...@apache.org
Subject svn commit: r1709884 [2/8] - in /incubator/datafu/site: ./ blog/ blog/2012/01/10/ blog/2013/01/24/ blog/2013/09/04/ blog/2013/10/03/ blog/2014/04/27/ community/ docs/ docs/datafu/ docs/datafu/guide/ docs/hourglass/ javascripts/ stylesheets/
Date Wed, 21 Oct 2015 17:00:40 GMT
Modified: incubator/datafu/site/blog/2013/09/04/datafu-1-0.html
URL: http://svn.apache.org/viewvc/incubator/datafu/site/blog/2013/09/04/datafu-1-0.html?rev=1709884&r1=1709883&r2=1709884&view=diff
==============================================================================
--- incubator/datafu/site/blog/2013/09/04/datafu-1-0.html (original)
+++ incubator/datafu/site/blog/2013/09/04/datafu-1-0.html Wed Oct 21 17:00:40 2015
@@ -1,3 +1,5 @@
+
+
 <!doctype html>
 <html>
   <head>
@@ -10,11 +12,9 @@
     <!-- Use title if it's in the page YAML frontmatter -->
     <title>DataFu 1.0</title>
     
-    <link href="/stylesheets/all.css" media="screen" rel="stylesheet" type="text/css" />
-<link href="/stylesheets/highlight.css" media="screen" rel="stylesheet" type="text/css" />
-    <script src="/javascripts/all.js" type="text/javascript"></script>
+    <link href="/stylesheets/all.css" rel="stylesheet" /><link href="/stylesheets/highlight.css" rel="stylesheet" />
+    <script src="/javascripts/all.js"></script>
 
-    
     <script type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-30533336-2']);
@@ -26,14 +26,14 @@
         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
       })();
     </script>
-    
   </head>
   
   <body class="blog blog_2013 blog_2013_09 blog_2013_09_04 blog_2013_09_04_datafu-1-0">
 
     <div class="container">
 
-      <div class="header">
+      
+<div class="header">
 
   <ul class="nav nav-pills pull-right">
     <li><a href="/">Home</a></li>
@@ -49,94 +49,98 @@
   <article class="col-lg-10">
     <h1>DataFu 1.0</h1>
     <h5 class="text-muted"><time>Sep  4, 2013</time></h5>
-    
       <h5 class="text-muted">William Vaughan</h5>
-    
 
     <hr>
 
-    <p><a href="http://data.linkedin.com/opensource/datafu">DataFu</a> is an open-source collection of user-defined functions for working with large-scale data in <a href="http://hadoop.apache.org/">Hadoop</a> and <a href="http://pig.apache.org/">Pig</a>.</p>
+    <p><em>Update (10/15/2015): The links in this blog post have been updated to point to the correct locations within the Apache DataFu website.</em></p>
+
+<p><a href="/">DataFu</a> is an open-source collection of user-defined functions for working with large-scale data in <a href="http://hadoop.apache.org/">Hadoop</a> and <a href="http://pig.apache.org/">Pig</a>.</p>
 
 <p>About two years ago, we recognized a need for a stable, well-tested library of Pig UDFs that could assist in common data mining and statistics tasks. Over the years, we had developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came <a href="http://pig.apache.org/docs/r0.11.1/test.html#pigunit">PigUnit</a>, which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have the initial release of DataFu.</p>
 
-<p>Since then, the project has continued to evolve. We have accepted contributions from a number of sources, improved the style and quality of testing, and adapted to the changing features and versions of Pig. During this time DataFu has been used extensively at LinkedIn for many of our data driven products like &quot;People You May Known&quot; and &quot;Skills and Endorsements.&quot; The library is used at numerous companies, and it has also been included in Cloudera&#39;s Hadoop distribution (<a href="http://www.cloudera.com/content/cloudera/en/products/cdh.html">CDH</a>) as well as the <a href="http://bigtop.apache.org/">Apache BigTop</a> project. DataFu has matured, and we are proud to announce the <a href="https://github.com/linkedin/datafu/blob/master/changes.md">1.0 release</a>.</p>
+<p>Since then, the project has continued to evolve. We have accepted contributions from a number of sources, improved the style and quality of testing, and adapted to the changing features and versions of Pig. During this time DataFu has been used extensively at LinkedIn for many of our data driven products like &quot;People You May Known&quot; and &quot;Skills and Endorsements.&quot; The library is used at numerous companies, and it has also been included in Cloudera&#39;s Hadoop distribution (<a href="http://www.cloudera.com/content/cloudera/en/products/cdh.html">CDH</a>) as well as the <a href="http://bigtop.apache.org/">Apache BigTop</a> project. DataFu has matured, and we are proud to announce the <a href="/docs/datafu/1.0.0/">1.0 release</a>.</p>
 
 <p>This release of DataFu has a number of new features that can make writing Pig easier, cleaner, and more efficient. In this post, we are going to highlight some of these new features by walking through a large number of examples. Think of this as a HowTo Pig + DataFu guide.</p>
 
-<h2 id="toc_0">Counting events</h2>
+<h2 id="counting-events">Counting events</h2>
 
 <p>Let&#39;s consider a hypothetical recommendation system. In this system, a user will be recommended an item (an impression). The user can then accept that recommendation, explicitly reject that recommendation, or just simply ignore it. A common task in such a system would be to count how many times users have seen and acted on items. How could we construct a pig script to implement this task?</p>
 
-<h2 id="toc_1">Setup</h2>
+<h2 id="setup">Setup</h2>
 
 <p>Before we start, it&#39;s best to define what exactly we want to do, our inputs and our outputs. The task is to generate, for each user, a list of all items that user has seen with a count of how many impressions were seen, how many were accepted, and how many were rejected.</p>
 
 <p>In summary, our desired output schema is:</p>
-<pre class="highlight text">features: {user_id:int, items:{(item_id:int, impression_count:int, accept_count:int, reject_count:int)}}
-</pre>
+<pre class="highlight plaintext"><code>features: {user_id:int, items:{(item_id:int, impression_count:int, accept_count:int, reject_count:int)}}
+</code></pre>
+
 <p>For input, we can load a record for each event:</p>
-<pre class="highlight pig"><span class="n">impressions</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$impressions'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="n">impressions</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$impressions'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
 <span class="n">accepts</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$accepts'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
 <span class="n">rejects</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$rejects'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
-</pre>
-<h2 id="toc_2">A naive approach</h2>
+</code></pre>
+
+<h2 id="a-naive-approach">A naive approach</h2>
 
 <p>The straight-forward approach to this problem generates each of the counts that we want, joins all of these counts together, and then groups them up by the user to produce the desired output:</p>
-<pre class="highlight pig"><span class="n">impressions_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">))</span> <span class="k">GENERATE</span>
+<pre class="highlight pig"><code><span class="n">impressions_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">))</span> <span class="k">GENERATE</span>
   <span class="k">FLATTEN</span><span class="p">(</span><span class="k">group</span><span class="p">)</span> <span class="k">as</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="k">COUNT_STAR</span><span class="p">(</span><span class="n">impressions</span><span class="p">)</span> <span class="k">as</span> <span class="k">count</span><span class="p">;</span>
 <span class="n">accepts_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">))</span> <span class="k">GENERATE</span>
   <span class="k">FLATTEN</span><span class="p">(</span><span class="k">group</span><span class="p">)</span> <span class="k">as</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="k">COUNT_STAR</span><span class="p">(</span><span class="n">accepts</span><span class="p">)</span> <span class="k">as</span> <span class="k">count</span><span class="p">;</span>
 <span class="n">rejects_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">))</span> <span class="k">GENERATE</span>
   <span class="k">FLATTEN</span><span class="p">(</span><span class="k">group</span><span class="p">)</span> <span class="k">as</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="k">COUNT_STAR</span><span class="p">(</span><span class="n">rejects</span><span class="p">)</span> <span class="k">as</span> <span class="k">count</span><span class="p">;</span>
 
-<span class="n">joined_accepts</span> <span class="o">=</span> <span class="k">JOIN</span> <span class="n">impressions_counted</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">)</span> <span class="k">LEFT</span> <span class="k">OUTER</span><span class="p">,</span> <span class="n">accepts_counted</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>  
-<span class="n">joined_accepts</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">joined_accepts</span> <span class="k">GENERATE</span> 
+<span class="n">joined_accepts</span> <span class="o">=</span> <span class="k">JOIN</span> <span class="n">impressions_counted</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">)</span> <span class="k">LEFT</span> <span class="k">OUTER</span><span class="p">,</span> <span class="n">accepts_counted</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>
+<span class="n">joined_accepts</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">joined_accepts</span> <span class="k">GENERATE</span>
   <span class="n">impressions_counted</span><span class="p">::</span><span class="n">user_id</span> <span class="k">as</span> <span class="n">user_id</span><span class="p">,</span>
   <span class="n">impressions_counted</span><span class="p">::</span><span class="n">item_id</span> <span class="k">as</span> <span class="n">item_id</span><span class="p">,</span>
   <span class="n">impressions_counted</span><span class="p">::</span><span class="k">count</span> <span class="k">as</span> <span class="n">impression_count</span><span class="p">,</span>
   <span class="p">((</span><span class="n">accepts_counted</span><span class="p">::</span><span class="k">count</span> <span class="k">is</span> <span class="n">null</span><span class="p">)</span><span class="o">?</span><span class="mi">0</span><span class="p">:</span><span class="n">accepts_counted</span><span class="p">::</span><span class="k">count</span><span class="p">)</span> <span class="k">as</span> <span class="n">accept_count</span><span class="p">;</span>
 
 <span class="n">joined_accepts_rejects</span> <span class="o">=</span> <span class="k">JOIN</span> <span class="n">joined_accepts</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">)</span> <span class="k">LEFT</span> <span class="k">OUTER</span><span class="p">,</span> <span class="n">rejects_counted</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>
-<span class="n">joined_accepts_rejects</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">joined_accepts_rejects</span> <span class="k">GENERATE</span> 
+<span class="n">joined_accepts_rejects</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">joined_accepts_rejects</span> <span class="k">GENERATE</span>
   <span class="n">joined_accepts</span><span class="p">::</span><span class="n">user_id</span> <span class="k">as</span> <span class="n">user_id</span><span class="p">,</span>
   <span class="n">joined_accepts</span><span class="p">::</span><span class="n">item_id</span> <span class="k">as</span> <span class="n">item_id</span><span class="p">,</span>
   <span class="n">joined_accepts</span><span class="p">::</span><span class="n">impression_count</span> <span class="k">as</span> <span class="n">impression_count</span><span class="p">,</span>
   <span class="n">joined_accepts</span><span class="p">::</span><span class="n">accept_count</span> <span class="k">as</span> <span class="n">accept_count</span><span class="p">,</span>
   <span class="p">((</span><span class="n">rejects_counted</span><span class="p">::</span><span class="k">count</span> <span class="k">is</span> <span class="n">null</span><span class="p">)</span><span class="o">?</span><span class="mi">0</span><span class="p">:</span><span class="n">rejects_counted</span><span class="p">::</span><span class="k">count</span><span class="p">)</span> <span class="k">as</span> <span class="n">reject_count</span><span class="p">;</span>
 
-<span class="n">features</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">joined_accepts_rejects</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">)</span> <span class="k">GENERATE</span> 
+<span class="n">features</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">joined_accepts_rejects</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">)</span> <span class="k">GENERATE</span>
   <span class="k">group</span> <span class="k">as</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">joined_accepts_rejects</span><span class="p">.(</span><span class="n">item_id</span><span class="p">,</span> <span class="n">impression_count</span><span class="p">,</span> <span class="n">accept_count</span><span class="p">,</span> <span class="n">reject_count</span><span class="p">)</span> <span class="k">as</span> <span class="n">items</span><span class="p">;</span>
-</pre>
+</code></pre>
+
 <p>Unfortunately, this approach is not very efficient. It generates six mapreduce jobs during execution and streams a lot of the same data through these jobs.</p>
 
-<h2 id="toc_3">A better approach</h2>
+<h2 id="a-better-approach">A better approach</h2>
 
 <p>Recognizing that we can combine the outer joins and group operations into a single <code>cogroup</code> allows us to reduce the number of mapreduce jobs.</p>
-<pre class="highlight pig"><span class="n">features_grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>
-<span class="n">features_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_grouped</span> <span class="k">GENERATE</span> 
+<pre class="highlight pig"><code><span class="n">features_grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>
+<span class="n">features_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_grouped</span> <span class="k">GENERATE</span>
   <span class="k">FLATTEN</span><span class="p">(</span><span class="k">group</span><span class="p">)</span> <span class="k">as</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span>
   <span class="k">COUNT_STAR</span><span class="p">(</span><span class="n">impressions</span><span class="p">)</span> <span class="k">as</span> <span class="n">impression_count</span><span class="p">,</span>
   <span class="k">COUNT_STAR</span><span class="p">(</span><span class="n">accepts</span><span class="p">)</span> <span class="k">as</span> <span class="n">accept_count</span><span class="p">,</span>
   <span class="k">COUNT_STAR</span><span class="p">(</span><span class="n">rejects</span><span class="p">)</span> <span class="k">as</span> <span class="n">reject_count</span><span class="p">;</span>
 
-<span class="n">features</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">features_counted</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">)</span> <span class="k">GENERATE</span> 
+<span class="n">features</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="p">(</span><span class="k">GROUP</span> <span class="n">features_counted</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">)</span> <span class="k">GENERATE</span>
   <span class="k">group</span> <span class="k">as</span> <span class="n">user_id</span><span class="p">,</span>
   <span class="n">features_counted</span><span class="p">.(</span><span class="n">item_id</span><span class="p">,</span> <span class="n">impression_count</span><span class="p">,</span> <span class="n">accept_count</span><span class="p">,</span> <span class="n">reject_count</span><span class="p">)</span> <span class="k">as</span> <span class="n">items</span><span class="p">;</span>
-</pre>
+</code></pre>
+
 <p>However, we still have to perform an extra group operation to bring everything together by <code>user_id</code> for a total of two mapreduce jobs.</p>
 
-<h2 id="toc_4">The best approach: DataFu</h2>
+<h2 id="the-best-approach-datafu">The best approach: DataFu</h2>
 
 <p>The two grouping operations in the last example operate on the same set of data. It would be great if we could just get rid of one of them somehow.</p>
 
 <p>One thing that we have noticed is that even very big data will frequently get reasonably small once you segment it sufficiently. In this case, we have to segment down to the user level for our output. That&#39;s small enough to fit in memory. So, with a little bit of DataFu, we can group up all of the data for that user, and process it in one pass:</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">CountEach</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">CountEach</span><span class="p">(</span><span class="s1">'flatten'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">CountEach</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">CountEach</span><span class="p">(</span><span class="s1">'flatten'</span><span class="p">);</span>
 <span class="k">DEFINE</span> <span class="n">BagLeftOuterJoin</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">BagLeftOuterJoin</span><span class="p">();</span>
 <span class="k">DEFINE</span> <span class="n">Coalesce</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">Coalesce</span><span class="p">();</span>
 
 <span class="n">features_grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">;</span>
 
-<span class="n">features_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_grouped</span> <span class="k">GENERATE</span> 
+<span class="n">features_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_grouped</span> <span class="k">GENERATE</span>
   <span class="k">group</span> <span class="k">as</span> <span class="n">user_id</span><span class="p">,</span>
   <span class="n">CountEach</span><span class="p">(</span><span class="n">impressions</span><span class="p">.</span><span class="n">item_id</span><span class="p">)</span> <span class="k">as</span> <span class="n">impressions</span><span class="p">,</span>
   <span class="n">CountEach</span><span class="p">(</span><span class="n">accepts</span><span class="p">.</span><span class="n">item_id</span><span class="p">)</span> <span class="k">as</span> <span class="n">accepts</span><span class="p">,</span>
@@ -158,35 +162,38 @@
     <span class="n">Coalesce</span><span class="p">(</span><span class="n">rejects</span><span class="p">::</span><span class="k">count</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="k">as</span> <span class="n">reject_count</span><span class="p">;</span>
   <span class="k">GENERATE</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">projected</span> <span class="k">as</span> <span class="n">items</span><span class="p">;</span>
 <span class="p">}</span>
-</pre>
+</code></pre>
+
 <p>So, let&#39;s step through this example and see how it works and what our data looks like along the way.</p>
 
-<h3 id="toc_5">Group the features</h3>
+<h3 id="group-the-features">Group the features</h3>
 
 <p>First we group all of the data together by the user, getting a few bags with all of the respective event data in the bag.</p>
-<pre class="highlight pig"><span class="n">features_grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">;</span>
+<pre class="highlight pig"><code><span class="n">features_grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">;</span>
 
 <span class="c1">--features_grouped: {group: int,impressions: {(user_id: int,item_id: int,timestamp: long)},accepts: {(user_id: int,item_id: int,timestamp: long)},rejects: {(user_id: int,item_id: int,timestamp: long)}}
-</span></pre>
-<h3 id="toc_6">CountEach</h3>
+</span></code></pre>
+
+<h3 id="counteach">CountEach</h3>
 
 <p>Next we count the occurences of each item in the impression, accept and reject bag.</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">CountEach</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">CountEach</span><span class="p">(</span><span class="s1">'flatten'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">CountEach</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">CountEach</span><span class="p">(</span><span class="s1">'flatten'</span><span class="p">);</span>
 
-<span class="n">features_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_grouped</span> <span class="k">GENERATE</span> 
+<span class="n">features_counted</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_grouped</span> <span class="k">GENERATE</span>
     <span class="k">group</span> <span class="k">as</span> <span class="n">user_id</span><span class="p">,</span>
     <span class="n">CountEach</span><span class="p">(</span><span class="n">impressions</span><span class="p">.</span><span class="n">item_id</span><span class="p">)</span> <span class="k">as</span> <span class="n">impressions</span><span class="p">,</span>
     <span class="n">CountEach</span><span class="p">(</span><span class="n">accepts</span><span class="p">.</span><span class="n">item_id</span><span class="p">)</span> <span class="k">as</span> <span class="n">accepts</span><span class="p">,</span>
     <span class="n">CountEach</span><span class="p">(</span><span class="n">rejects</span><span class="p">.</span><span class="n">item_id</span><span class="p">)</span> <span class="k">as</span> <span class="n">rejects</span><span class="p">;</span>
 
 <span class="c1">--features_counted: {user_id: int,impressions: {(item_id: int,count: int)},accepts: {(item_id: int,count: int)},rejects: {(item_id: int,count: int)}}
-</span></pre>
+</span></code></pre>
+
 <p>CountEach is a new UDF in DataFu that iterates through a bag counting the number of occurrences of each distinct tuple. In this case, we want to count occurrences of items, so we project the inner tuples of the bag to contain just the <code>item_id</code>. Since we specified the optional &#39;flatten&#39; argument in the constructor, the output of the UDF will be a bag of each distinct input tuple (item_id) with a count field appended.</p>
 
-<h3 id="toc_7">BagLeftOuterJoin</h3>
+<h3 id="bagleftouterjoin">BagLeftOuterJoin</h3>
 
 <p>Now, we want to combine all of the separate counts for each type of event together into one tuple per item.</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">BagLeftOuterJoin</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">BagLeftOuterJoin</span><span class="p">();</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">BagLeftOuterJoin</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">bags</span><span class="p">.</span><span class="n">BagLeftOuterJoin</span><span class="p">();</span>
 
 <span class="n">features_joined</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_counted</span> <span class="k">GENERATE</span>
     <span class="n">user_id</span><span class="p">,</span>
@@ -197,20 +204,22 @@
     <span class="p">)</span> <span class="k">as</span> <span class="n">items</span><span class="p">;</span>
 
 <span class="c1">--features_joined: {user_id: int,items: {(impressions::item_id: int,impressions::count: int,accepts::item_id: int,accepts::count: int,rejects::item_id: int,rejects::count: int)}}
-</span></pre>
+</span></code></pre>
+
 <p>This is a join operation, but unfortunately, the only join operation that pig allows on bags (in a nested foreach) is <code>CROSS</code>. DataFu provides the BagLeftOuterJoin UDF to make up for this limitation. This UDF performs an in-memory hash join of each bag using the specified field as the join key. The output of this UDF mimics what you would expect from this bit of not (yet) valid Pig:</p>
-<pre class="highlight pig"><span class="n">features_joined</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_counted</span> <span class="p">{</span>
+<pre class="highlight pig"><code><span class="n">features_joined</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_counted</span> <span class="p">{</span>
   <span class="n">items</span> <span class="o">=</span> <span class="k">JOIN</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="n">item_id</span> <span class="k">LEFT</span> <span class="k">OUTER</span><span class="p">,</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="n">item_id</span><span class="p">,</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="n">item_id</span><span class="p">;</span>
   <span class="k">GENERATE</span>
     <span class="n">user_id</span><span class="p">,</span> <span class="n">items</span><span class="p">;</span>
 <span class="p">}</span>
-</pre>
+</code></pre>
+
 <p>Because <code>BagLeftOuterJoin</code> is a UDF and works in memory, a separate map-reduce job is not launched. This fact will save us some time as we&#39;ll see later on in the analysis.</p>
 
-<h3 id="toc_8">Coalesce</h3>
+<h3 id="coalesce">Coalesce</h3>
 
 <p>Finally, we have our data in about the right shape. We just need to clean up the schema and put some default values in place.</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">Coalesce</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">Coalesce</span><span class="p">();</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">Coalesce</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">Coalesce</span><span class="p">();</span>
 
 <span class="n">features</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">features_joined</span> <span class="p">{</span>
     <span class="n">projected</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">items</span> <span class="k">GENERATE</span>
@@ -222,10 +231,11 @@
 <span class="p">}</span>
 
 <span class="c1">--features: {user_id: int,items: {(item_id: int,impression_count: int,accept_count: int,reject_count: int)}}
-</span></pre>
+</span></code></pre>
+
 <p>The various counts were joined together using an outer join in the previous step because a user has not necessarily performed an accept or reject action on each item that he or she has seen. If they have not acted, those fields will be null. <code>Coalesce</code> returns its first non-null parameter, which allows us to cleanly replace that null with a zero, avoiding the need for a bincond operator and maintaining the correct schema. Done!</p>
 
-<h2 id="toc_9">Analysis</h2>
+<h2 id="analysis">Analysis</h2>
 
 <p>Ok great, we now have three ways to write the same script. We know that the naive way will trigger six mapreduce jobs, the better way two, and the DataFu way one, but does that really equate to a difference in performance?</p>
 
@@ -254,33 +264,33 @@
 
 <p>As we can see, the DataFu version provides a noticable improvement in both metrics. Glad to know that work wasn&#39;t all for naught.</p>
 
-<h2 id="toc_10">Creating a custom purpose UDF</h2>
+<h2 id="creating-a-custom-purpose-udf">Creating a custom purpose UDF</h2>
 
 <p>Many UDFs, such as those presented in the previous section, are general purpose. DataFu serves to collect these UDFs and make sure they are tested and easily available. If you are writing such a UDF, then we will happily accept contributions. However, frequently when you sit down to write a UDF, it is because you need to insert some sort of custom business logic or calculation into your pig script. These types of UDFs can easily become complex, involving a large number of parameters or nested structures.</p>
 
-<h2 id="toc_11">Positional notation is bad</h2>
+<h2 id="positional-notation-is-bad">Positional notation is bad</h2>
 
 <p>Even once the code is written, you are not done. You have to maintain it.</p>
 
 <p>One of the difficult parts about this maintenance is that, as the pig script that uses the UDF changes, a developer has to be sure not to change the parameters to the UDF. Worse, because a standard UDF references fields by positions, it&#39;s very easy to introduce a subtle change that has an unintended side effect that does not trigger any errors during runtime, for example, when two fields of the same type swap positions.</p>
 
-<h2 id="toc_12">Aliases can be better</h2>
+<h2 id="aliases-can-be-better">Aliases can be better</h2>
 
 <p>Using aliases instead of positions makes it easier to maintain a consistent mapping between the UDF and the pig script. If an alias is removed, the UDF will fail with an error. If an alias changes position in a tuple, the UDF does not need to care. The alias also has some semantic meaning to the developer which can aid in the maintenance proces.</p>
 
-<h2 id="toc_13">AliasableEvalFunc</h2>
+<h2 id="aliasableevalfunc">AliasableEvalFunc</h2>
 
 <p>Unfortunately, there is a problem using aliases. As of Pig 11.1 they are not available when the UDF is exec&#39;ing on the back-end; they are only available on the front-end. The solution to this is to capture a mapping of alias to position on the front-end, store that mapping into the UDF context, retreive it on the back-end, and use it to look up each position by alias. You also need to handle a few issues with complex schemas (nested tuples and bags), keeping track of UDF instances, etc. To make this process simpler, DataFu provides <code>AliasableEvalFunc</code>, an extension to the standard <code>EvalFunc</code> with all of this behavior included.</p>
 
-<h3 id="toc_14">Mortgage payment example</h3>
+<h3 id="mortgage-payment-example">Mortgage payment example</h3>
 
 <p>Using <code>AliasableEvalFunc</code> is pretty simple; the primary difference is that you need to override <code>getOutputSchema</code> instead of <code>outputSchema</code> and have access to the alias, position map through a number of convenience methods. Consider the following example:</p>
-<pre class="highlight java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">MortgagePayment</span> <span class="kd">extends</span> <span class="n">AliasableEvalFunc</span><span class="o">&lt;</span><span class="n">DataBag</span><span class="o">&gt;</span> <span class="o">{</span>
+<pre class="highlight java"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">MortgagePayment</span> <span class="kd">extends</span> <span class="n">AliasableEvalFunc</span><span class="o">&lt;</span><span class="n">DataBag</span><span class="o">&gt;</span> <span class="o">{</span>
   <span class="nd">@Override</span>
   <span class="kd">public</span> <span class="n">Schema</span> <span class="n">getOutputSchema</span><span class="o">(</span><span class="n">Schema</span> <span class="n">input</span><span class="o">)</span> <span class="o">{</span>
     <span class="k">try</span> <span class="o">{</span>
       <span class="n">Schema</span> <span class="n">tupleSchema</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Schema</span><span class="o">();</span>
-      <span class="n">tupleSchema</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="n">Schema</span><span class="o">.</span><span class="na">FieldSchema</span><span class="o">(</span><span class="s">&quot;monthly_payment&quot;</span><span class="o">,</span> <span class="n">DataType</span><span class="o">.</span><span class="na">DOUBLE</span><span class="o">));</span>
+      <span class="n">tupleSchema</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="n">Schema</span><span class="o">.</span><span class="na">FieldSchema</span><span class="o">(</span><span class="s">"monthly_payment"</span><span class="o">,</span> <span class="n">DataType</span><span class="o">.</span><span class="na">DOUBLE</span><span class="o">));</span>
       <span class="n">Schema</span> <span class="n">bagSchema</span><span class="o">;</span>
 
       <span class="n">bagSchema</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Schema</span><span class="o">(</span><span class="k">new</span> <span class="n">Schema</span><span class="o">.</span><span class="na">FieldSchema</span><span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="na">getClass</span><span class="o">().</span><span class="na">getName</span><span class="o">().</span><span class="na">toLowerCase</span><span class="o">(),</span> <span class="n">tupleSchema</span><span class="o">,</span> <span class="n">DataType</span><span class="o">.</span><span class="na">BAG</span><span class="o">));</span>
@@ -295,35 +305,36 @@
     <span class="n">DataBag</span> <span class="n">output</span> <span class="o">=</span> <span class="n">BagFactory</span><span class="o">.</span><span class="na">getInstance</span><span class="o">().</span><span class="na">newDefaultBag</span><span class="o">();</span>
 
     <span class="c1">// get a value from the input tuple by alias</span>
-    <span class="n">Double</span> <span class="n">principal</span> <span class="o">=</span> <span class="n">getDouble</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="s">&quot;principal&quot;</span><span class="o">);</span>
-    <span class="n">Integer</span> <span class="n">numPayments</span> <span class="o">=</span> <span class="n">getInteger</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="s">&quot;num_payments&quot;</span><span class="o">);</span>
-    <span class="n">DataBag</span> <span class="kt">int</span><span class="n">erestRates</span> <span class="o">=</span> <span class="n">getBag</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="s">&quot;interest_rates&quot;</span><span class="o">);</span>
+    <span class="n">Double</span> <span class="n">principal</span> <span class="o">=</span> <span class="n">getDouble</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="s">"principal"</span><span class="o">);</span>
+    <span class="n">Integer</span> <span class="n">numPayments</span> <span class="o">=</span> <span class="n">getInteger</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="s">"num_payments"</span><span class="o">);</span>
+    <span class="n">DataBag</span> <span class="n">interestRates</span> <span class="o">=</span> <span class="n">getBag</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="s">"interest_rates"</span><span class="o">);</span>
 
-    <span class="k">for</span> <span class="o">(</span><span class="n">Tuple</span> <span class="kt">int</span><span class="n">erestTuple</span> <span class="o">:</span> <span class="kt">int</span><span class="n">erestRates</span><span class="o">)</span> <span class="o">{</span>
+    <span class="k">for</span> <span class="o">(</span><span class="n">Tuple</span> <span class="n">interestTuple</span> <span class="o">:</span> <span class="n">interestRates</span><span class="o">)</span> <span class="o">{</span>
       <span class="c1">// get a value from the inner bag tuple by alias</span>
-      <span class="n">Double</span> <span class="kt">int</span><span class="n">erest</span> <span class="o">=</span> <span class="n">getDouble</span><span class="o">(</span><span class="kt">int</span><span class="n">erestTuple</span><span class="o">,</span> <span class="n">getPrefixedAliasName</span><span class="o">(</span><span class="s">&quot;interest_rates&quot;</span><span class="o">,</span> <span class="s">&quot;interest_rate&quot;</span><span class="o">));</span>
-      <span class="kt">double</span> <span class="n">monthlyPayment</span> <span class="o">=</span> <span class="n">computeMonthlyPayment</span><span class="o">(</span><span class="n">principal</span><span class="o">,</span> <span class="n">numPayments</span><span class="o">,</span> <span class="kt">int</span><span class="n">erest</span><span class="o">);</span>
+      <span class="n">Double</span> <span class="n">interest</span> <span class="o">=</span> <span class="n">getDouble</span><span class="o">(</span><span class="n">interestTuple</span><span class="o">,</span> <span class="n">getPrefixedAliasName</span><span class="o">(</span><span class="s">"interest_rates"</span><span class="o">,</span> <span class="s">"interest_rate"</span><span class="o">));</span>
+      <span class="kt">double</span> <span class="n">monthlyPayment</span> <span class="o">=</span> <span class="n">computeMonthlyPayment</span><span class="o">(</span><span class="n">principal</span><span class="o">,</span> <span class="n">numPayments</span><span class="o">,</span> <span class="n">interest</span><span class="o">);</span>
       <span class="n">output</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">TupleFactory</span><span class="o">.</span><span class="na">getInstance</span><span class="o">().</span><span class="na">newTuple</span><span class="o">(</span><span class="n">monthlyPayment</span><span class="o">));</span>
     <span class="o">}</span>
 
     <span class="k">return</span> <span class="n">output</span><span class="o">;</span>
   <span class="o">}</span>
 
-  <span class="kd">private</span> <span class="kt">double</span> <span class="n">computeMonthlyPayment</span><span class="o">(</span><span class="n">Double</span> <span class="n">principal</span><span class="o">,</span> <span class="n">Integer</span> <span class="n">numPayments</span><span class="o">,</span> <span class="n">Double</span> <span class="kt">int</span><span class="n">erest</span><span class="o">)</span> <span class="o">{</span>
-    <span class="k">return</span> <span class="n">principal</span> <span class="o">*</span> <span class="o">(</span><span class="kt">int</span><span class="n">erest</span> <span class="o">*</span> <span class="n">Math</span><span class="o">.</span><span class="na">pow</span><span class="o">(</span><span class="kt">int</span><span class="n">erest</span><span class="o">+</span><span class="mi">1</span><span class="o">,</span> <span class="n">numPayments</span><span class="o">))</span> <span class="o">/</span> <span class="o">(</span><span class="n">Math</span><span class="o">.</span><span class="na">pow</span><span class="o">(</span><span class="kt">int</span><span class="n">erest</span><span class="o">+</span><span class="mi">1</span><span class="o">,</span> <span class="n">numPayments</span><span class="o">)</span> <span class="o">-</span> <span class="mf">1.0</span><span class="o">);</span>
+  <span class="kd">private</span> <span class="kt">double</span> <span class="n">computeMonthlyPayment</span><span class="o">(</span><span class="n">Double</span> <span class="n">principal</span><span class="o">,</span> <span class="n">Integer</span> <span class="n">numPayments</span><span class="o">,</span> <span class="n">Double</span> <span class="n">interest</span><span class="o">)</span> <span class="o">{</span>
+    <span class="k">return</span> <span class="n">principal</span> <span class="o">*</span> <span class="o">(</span><span class="n">interest</span> <span class="o">*</span> <span class="n">Math</span><span class="o">.</span><span class="na">pow</span><span class="o">(</span><span class="n">interest</span><span class="o">+</span><span class="mi">1</span><span class="o">,</span> <span class="n">numPayments</span><span class="o">))</span> <span class="o">/</span> <span class="o">(</span><span class="n">Math</span><span class="o">.</span><span class="na">pow</span><span class="o">(</span><span class="n">interest</span><span class="o">+</span><span class="mi">1</span><span class="o">,</span> <span class="n">numPayments</span><span class="o">)</span> <span class="o">-</span> <span class="mf">1.0</span><span class="o">);</span>
   <span class="o">}</span>
 <span class="o">}</span>
-</pre>
+</code></pre>
+
 <p>In this script we retrieve by alias from the input tuple a couple of different types of fields. One of these fields is a bag, and we also want to get values from the tuples in that bag. To avoid having namespace collisions among the different levels of nested tuples, AliasableEvalFunc prepends the name of the enclosing bag or tuple. Thus, we use <code>getPrefixedAliasName</code> to find the field <code>interest_rate</code> inside the bag named <code>interest_rates</code>. That&#39;s all there is to using aliases in a UDF. As an added benefit, being able to dump schema information on errors helps in developing and debugging the UDF (see <code>datafu.pig.util.DataFuException</code>).</p>
 
-<h3 id="toc_15">LinearRegression example</h3>
+<h3 id="linearregression-example">LinearRegression example</h3>
 
 <p>Having access to the schema opens up UDF development possibilities. Let&#39;s look back at the recommendation system example from the first part. The script in that part generated a bunch of features about the items that users saw and clicked. That&#39;s a good start to a recommendation workflow, but the end goal is to select which items to recommend. A common way to do this is to assign a score to each item based on some sort of machine learning algorithm. A simple algorithm for this task is linear regression. Ok, let&#39;s say we&#39;ve trained our first linear regression model and are ready to plug it in to our workflow to produce our scores.</p>
 
 <p>We could develop a custom UDF for this model that computes the score. It is just a weighted sum of the features. So, using <code>AliasableEvalFunc</code> we could retrieve each field that we need, multiply by the correct coefficient, and then sum these together. But, then every time we change the model, we are going to have to change the UDF to update the fields and coefficients. We know that our first model is not going to be very good and want to make it easy to plug in new models.</p>
 
 <p>The model for a linear regression is pretty simple; it&#39;s just a mapping of fields to coefficient values. The only things that will change between models are which fields we are interested in and what the coefficient for those fields will be. So, let&#39;s just pass in a string representation of the model and then let the UDF do the work of figuring out how to apply it.</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">LinearRegression</span> <span class="n">datafu</span><span class="p">.</span><span class="n">test</span><span class="p">.</span><span class="n">blog</span><span class="p">.</span><span class="n">LinearRegression</span><span class="p">(</span><span class="s1">'intercept:1,impression_count:-0.1,accept_count:2.0,reject_count:-1.0'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">LinearRegression</span> <span class="n">datafu</span><span class="p">.</span><span class="n">test</span><span class="p">.</span><span class="n">blog</span><span class="p">.</span><span class="n">LinearRegression</span><span class="p">(</span><span class="s1">'intercept:1,impression_count:-0.1,accept_count:2.0,reject_count:-1.0'</span><span class="p">);</span>
 
 <span class="n">features</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'test/pig/datafu/test/blog/features.dat'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">items</span><span class="p">:</span><span class="n">bag</span><span class="p">{(</span><span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span><span class="n">impression_count</span><span class="p">:</span><span class="n">int</span><span class="p">,</span><span class="n">accept_count</span><span class="p">:</span><span class="n">int</span><span class="p">,</span><span class="n">reject_count</span><span class="p">:</span><span class="n">int</span><span class="p">)});</span>
 
@@ -331,25 +342,26 @@
   <span class="n">scored_items</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">items</span> <span class="k">GENERATE</span> <span class="n">item_id</span><span class="p">,</span> <span class="n">LinearRegression</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">as</span> <span class="n">score</span><span class="p">;</span>
   <span class="k">GENERATE</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">scored_items</span> <span class="k">as</span> <span class="n">items</span><span class="p">;</span>
 <span class="p">}</span>
-</pre>
+</code></pre>
+
 <p>Nice, that&#39;s clean, and we could even pass that model string in as a parameter so we don&#39;t have to change the pig script to change the model either -- very reusable.</p>
 
 <p>Now, the hard work, writing the UDF:</p>
-<pre class="highlight java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">LinearRegression</span> <span class="kd">extends</span> <span class="n">AliasableEvalFunc</span><span class="o">&lt;</span><span class="n">Double</span><span class="o">&gt;</span>
+<pre class="highlight java"><code><span class="kd">public</span> <span class="kd">class</span> <span class="nc">LinearRegression</span> <span class="kd">extends</span> <span class="n">AliasableEvalFunc</span><span class="o">&lt;</span><span class="n">Double</span><span class="o">&gt;</span>
 <span class="o">{</span>
   <span class="n">Map</span><span class="o">&lt;</span><span class="n">String</span><span class="o">,</span> <span class="n">Double</span><span class="o">&gt;</span> <span class="n">parameters</span><span class="o">;</span>
 
   <span class="kd">public</span> <span class="n">LinearRegression</span><span class="o">(</span><span class="n">String</span> <span class="n">parameterString</span><span class="o">)</span> <span class="o">{</span>
     <span class="n">parameters</span> <span class="o">=</span> <span class="k">new</span> <span class="n">HashMap</span><span class="o">&lt;</span><span class="n">String</span><span class="o">,</span> <span class="n">Double</span><span class="o">&gt;();</span>
-    <span class="k">for</span> <span class="o">(</span><span class="n">String</span> <span class="n">token</span> <span class="o">:</span> <span class="n">parameterString</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">&quot;,&quot;</span><span class="o">))</span> <span class="o">{</span>
-      <span class="n">String</span><span class="o">[]</span> <span class="n">keyValue</span> <span class="o">=</span> <span class="n">token</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">&quot;:&quot;</span><span class="o">);</span>
+    <span class="k">for</span> <span class="o">(</span><span class="n">String</span> <span class="n">token</span> <span class="o">:</span> <span class="n">parameterString</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">","</span><span class="o">))</span> <span class="o">{</span>
+      <span class="n">String</span><span class="o">[]</span> <span class="n">keyValue</span> <span class="o">=</span> <span class="n">token</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">":"</span><span class="o">);</span>
       <span class="n">parameters</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">keyValue</span><span class="o">[</span><span class="mi">0</span><span class="o">].</span><span class="na">trim</span><span class="o">(),</span> <span class="n">Double</span><span class="o">.</span><span class="na">parseDouble</span><span class="o">(</span><span class="n">keyValue</span><span class="o">[</span><span class="mi">1</span><span class="o">].</span><span class="na">trim</span><span class="o">()));</span>
-    <span class="o">}</span>     
+    <span class="o">}</span>
   <span class="o">}</span>
 
   <span class="nd">@Override</span>
   <span class="kd">public</span> <span class="n">Schema</span> <span class="n">getOutputSchema</span><span class="o">(</span><span class="n">Schema</span> <span class="n">input</span><span class="o">)</span> <span class="o">{</span>
-    <span class="k">return</span> <span class="k">new</span> <span class="n">Schema</span><span class="o">(</span><span class="k">new</span> <span class="n">Schema</span><span class="o">.</span><span class="na">FieldSchema</span><span class="o">(</span><span class="s">&quot;score&quot;</span><span class="o">,</span> <span class="n">DataType</span><span class="o">.</span><span class="na">DOUBLE</span><span class="o">));</span>
+    <span class="k">return</span> <span class="k">new</span> <span class="n">Schema</span><span class="o">(</span><span class="k">new</span> <span class="n">Schema</span><span class="o">.</span><span class="na">FieldSchema</span><span class="o">(</span><span class="s">"score"</span><span class="o">,</span> <span class="n">DataType</span><span class="o">.</span><span class="na">DOUBLE</span><span class="o">));</span>
   <span class="o">}</span>
 
   <span class="nd">@Override</span>
@@ -357,7 +369,7 @@
     <span class="kt">double</span> <span class="n">score</span> <span class="o">=</span> <span class="mf">0.0</span><span class="o">;</span>
     <span class="k">for</span> <span class="o">(</span><span class="n">String</span> <span class="n">parameter</span> <span class="o">:</span> <span class="n">parameters</span><span class="o">.</span><span class="na">keySet</span><span class="o">())</span> <span class="o">{</span>
       <span class="kt">double</span> <span class="n">coefficient</span> <span class="o">=</span> <span class="n">parameters</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">parameter</span><span class="o">);</span>
-      <span class="k">if</span> <span class="o">(</span><span class="n">parameter</span><span class="o">.</span><span class="na">equals</span><span class="o">(</span><span class="s">&quot;intercept&quot;</span><span class="o">))</span> <span class="o">{</span>
+      <span class="k">if</span> <span class="o">(</span><span class="n">parameter</span><span class="o">.</span><span class="na">equals</span><span class="o">(</span><span class="s">"intercept"</span><span class="o">))</span> <span class="o">{</span>
         <span class="n">score</span> <span class="o">+=</span> <span class="n">coefficient</span><span class="o">;</span>
       <span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
         <span class="n">score</span> <span class="o">+=</span> <span class="n">coefficient</span> <span class="o">*</span> <span class="n">getDouble</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="n">parameter</span><span class="o">);</span>
@@ -366,41 +378,44 @@
     <span class="k">return</span> <span class="n">score</span><span class="o">;</span>
   <span class="o">}</span>
 <span class="o">}</span>
-</pre>
+</code></pre>
+
 <p>Ok, maybe not that hard... The UDF parses out the mapping of field to coeffcient in the constructor and then looks up the specified fields by name in the exec function. So, what happens when we change the model? If we decide to drop a field from our model, it just gets ignored, even if it is in the input tuple. If we add a new feature that&#39;s already available in the data it will just work. If we try and use a model with a new feature and forget to update the pig script, it will throw an error and tell us the feature that does not exist (as part of the behavior of <code>getDouble()</code>).</p>
 
 <p>Combining this example with the feature counting example presented earlier, we have the basis for a recommendation system that was easy to write, will execute quickly, and will be simple to maintain.</p>
 
-<h2 id="toc_16">Sampling the data</h2>
+<h2 id="sampling-the-data">Sampling the data</h2>
 
 <p>Working with big data can be a bit overwhelming and time consuming. Sometimes you want to avoid some of this hassle and just look at a portion of this data. Pig has built-in support for random sampling with the <code>SAMPLE</code> operator. But sometimes a random percentage of the records is not quite what you need. Fortunately, DataFu has a few sampling UDFs that will help in some situations, and as always, we would be happy to accept any contributions of additional sampling UDFs, if you happen to have some lying around.</p>
 
 <p>These things always are easier to understand with a bit of code, so let&#39;s go back to our recommendation system context and look at a few more examples.</p>
 
-<h2 id="toc_17">Example 1. Generate training data</h2>
+<h2 id="example-1-generate-training-data">Example 1. Generate training data</h2>
 
 <p>We had mentioned previously that we were going to use a machine learning algorithm, linear regression, to generate scores for our items. We waived our hands and it happened previously, but generally this task involves some work. One of the first steps is to generate the training data set for the learning algorithm. In order to make this training efficient, we only want to use a sample of all of our raw data.</p>
 
-<h3 id="toc_18">Setup</h3>
+<h3 id="setup">Setup</h3>
 
 <p>Given impression, accepts, rejects and some pre-computed features about a user and items, we&#39;d like to generate a training set, which will have all of this information for each <code>(user_id, item_id)</code> pair, for some sample of users.</p>
 
 <p>So, from this input:</p>
-<pre class="highlight pig"><span class="n">impressions</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$impressions'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="n">impressions</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$impressions'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
 <span class="n">accepts</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$accepts'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
 <span class="n">rejects</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$rejects'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">timestamp</span><span class="p">:</span><span class="n">long</span><span class="p">);</span>
 <span class="n">features</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$features'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">feature_1</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">feature_2</span><span class="p">:</span><span class="n">int</span><span class="p">)</span>
-</pre>
+</code></pre>
+
 <p>We want to produce this type of output:</p>
-<pre class="highlight text">{user_id, item_id, is_impressed, is_accepted, is_rejected, feature_1, feature_2}
-</pre>
+<pre class="highlight json"><code><span class="p">{</span><span class="err">user_id,</span><span class="w"> </span><span class="err">item_id,</span><span class="w"> </span><span class="err">is_impressed,</span><span class="w"> </span><span class="err">is_accepted,</span><span class="w"> </span><span class="err">is_rejected,</span><span class="w"> </span><span class="err">feature_1,</span><span class="w"> </span><span class="err">feature_2</span><span class="p">}</span><span class="w">
+</span></code></pre>
+
 <p>One key point on sampling here: We want the sampling to be done by <code>user_id</code>. This means that if we choose one <code>user_id</code> to be included in the sample, all the data for that <code>user_id</code> should be included in the sample. This requirement is needed to preserve the original characteristics of raw data in the sampled data as well.</p>
 
-<h3 id="toc_19">Naive approach</h3>
+<h3 id="naive-approach">Naive approach</h3>
 
 <p>The staright-foward solution for this task will be group the tracking data for each user, item pair, then group it by <code>user_id</code>, sample this grouped data, and then flatten it all out again again:</p>
-<pre class="highlight pig"><span class="n">grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">)</span> <span class="n">features</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>
-<span class="n">full_result</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">grouped</span> <span class="n">GENREATE</span> 
+<pre class="highlight pig"><code><span class="n">grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">)</span> <span class="n">features</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>
+<span class="n">full_result</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">grouped</span> <span class="n">GENREATE</span>
   <span class="k">FLATTEN</span><span class="p">(</span><span class="k">group</span><span class="p">)</span> <span class="k">AS</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">,</span>
   <span class="p">(</span><span class="n">impressions</span><span class="p">::</span><span class="n">timestamp</span> <span class="k">is</span> <span class="n">null</span><span class="p">)</span><span class="o">?</span><span class="mi">1</span><span class="p">:</span><span class="mi">0</span> <span class="k">AS</span> <span class="n">is_impressed</span><span class="p">,</span>
   <span class="p">(</span><span class="n">accepts</span><span class="p">::</span><span class="n">timestamp</span> <span class="k">is</span> <span class="n">null</span><span class="p">)</span><span class="o">?</span><span class="mi">1</span><span class="p">:</span><span class="mi">0</span> <span class="k">AS</span> <span class="n">is_accepted</span><span class="p">,</span>
@@ -410,16 +425,17 @@
 
 <span class="n">grouped_full_result</span> <span class="o">=</span> <span class="k">GROUP</span> <span class="n">full_result</span> <span class="k">BY</span> <span class="n">user_id</span><span class="p">;</span>
 <span class="n">sampled</span> <span class="o">=</span> <span class="k">SAMPLE</span> <span class="n">grouped_full_result</span> <span class="k">BY</span> <span class="k">group</span> <span class="mi">0</span><span class="p">.</span><span class="mi">01</span><span class="p">;</span>
-<span class="n">result</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">sampled</span> <span class="k">GENERATE</span> 
+<span class="n">result</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">sampled</span> <span class="k">GENERATE</span>
   <span class="k">group</span> <span class="k">AS</span> <span class="n">user_id</span><span class="p">,</span>
   <span class="k">FLATTEN</span><span class="p">(</span><span class="n">full_result</span><span class="p">);</span>
-</pre>
+</code></pre>
+
 <p>This job includes two group operations, which translates to two map-reduce jobs. Also, the group operation is being done on the full data even though we will sample it down later. Can we do any better than this?</p>
 
-<h3 id="toc_20">A sample of DataFu -- SampleByKey</h3>
+<h3 id="a-sample-of-datafu-samplebykey">A sample of DataFu -- SampleByKey</h3>
 
 <p>Yep.</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">SampleByKey</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">sampling</span><span class="p">.</span><span class="n">SampleByKey</span><span class="p">(</span><span class="s1">'whatever_the_salt_you_want_to_use'</span><span class="p">,</span><span class="s1">'0.01'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">SampleByKey</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">sampling</span><span class="p">.</span><span class="n">SampleByKey</span><span class="p">(</span><span class="s1">'whatever_the_salt_you_want_to_use'</span><span class="p">,</span><span class="s1">'0.01'</span><span class="p">);</span>
 
 <span class="n">impressions</span> <span class="o">=</span> <span class="k">FILTER</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="n">SampleByKey</span><span class="p">(</span><span class="s1">'user_id'</span><span class="p">);</span>
 <span class="n">accepts</span> <span class="o">=</span> <span class="k">FILTER</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="n">SampleByKey</span><span class="p">(</span><span class="s1">'user_id'</span><span class="p">);</span>
@@ -427,57 +443,61 @@
 <span class="n">features</span> <span class="o">=</span> <span class="k">FILTER</span> <span class="n">features</span> <span class="k">BY</span> <span class="n">SampleByKey</span><span class="p">(</span><span class="s1">'user_id'</span><span class="p">);</span>
 
 <span class="n">grouped</span> <span class="o">=</span> <span class="k">COGROUP</span> <span class="n">impressions</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">accepts</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">rejects</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span> <span class="n">features</span> <span class="k">BY</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">);</span>
-<span class="n">result</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">grouped</span> <span class="n">GENREATE</span> 
+<span class="n">result</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">grouped</span> <span class="n">GENREATE</span>
   <span class="k">FLATTEN</span><span class="p">(</span><span class="k">group</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">,</span> <span class="n">item_id</span><span class="p">),</span>
   <span class="p">(</span><span class="n">impressions</span><span class="p">::</span><span class="n">timestamp</span> <span class="k">is</span> <span class="n">null</span><span class="p">)</span><span class="o">?</span><span class="mi">1</span><span class="p">:</span><span class="mi">0</span> <span class="k">AS</span> <span class="n">is_impressed</span><span class="p">,</span>
   <span class="p">(</span><span class="n">accepts</span><span class="p">::</span><span class="n">timestamp</span> <span class="k">is</span> <span class="n">null</span><span class="p">)</span><span class="o">?</span><span class="mi">1</span><span class="p">:</span><span class="mi">0</span> <span class="k">AS</span> <span class="n">is_accepted</span><span class="p">,</span>
   <span class="p">(</span><span class="n">rejects</span><span class="p">::</span><span class="n">timestamp</span> <span class="k">is</span> <span class="n">null</span><span class="p">)</span><span class="o">?</span><span class="mi">1</span><span class="p">:</span><span class="mi">0</span> <span class="k">AS</span> <span class="n">is_rejected</span><span class="p">,</span>
   <span class="n">Coalesce</span><span class="p">(</span><span class="n">features</span><span class="p">::</span><span class="n">feature_1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="k">AS</span> <span class="n">feature_1</span><span class="p">,</span>
   <span class="n">Coalesce</span><span class="p">(</span><span class="n">features</span><span class="p">::</span><span class="n">feature_2</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="k">AS</span> <span class="n">feature_2</span><span class="p">;</span>
-</pre>
+</code></pre>
+
 <p>We can use the <code>SampleByKey</code> FilterFunc to do this with only one group operation. And, since the group is operating on the already sampled (significantly smaller) data this job will be far more efficient.</p>
 
 <p><code>SampleByKey</code> lets you designate which fields you want to use as keys for the sampling, and guarantees that for each selected key, all other records with that key will also be selected, which is exactly what we want. Another charasteritic of <code>SampleByKey</code> is that it is deterministic, as long as the same salt is given on initialization. Thanks to this charastristic, we were able to sample the data seperately before we join them from the above example.</p>
 
-<h2 id="toc_21">Example 2. Recommending your output</h2>
+<h2 id="example-2-recommending-your-output">Example 2. Recommending your output</h2>
 
 <p>Ok, we&#39;ve now created some training data that we used to create a model which will produce a score for each recommendation. So now we&#39;ve got to pick which items to show the user. But, we&#39;ve got a bit of a problem, we only have limited real-estate on the screen to present our recommendations, so how do we select which ones to show? We&#39;ve got a score from our model so we could just always pick the top scoring items. But then we might be showing the same recommendations all the time, and we want to shake things up a bit so things aren&#39;t so static (OK, yes, I admit this is a contrived example; you wouldn&#39;t do it this way in real life). So let&#39;s take a sample of the output.</p>
 
-<h3 id="toc_22">Setup</h3>
+<h3 id="setup">Setup</h3>
 
 <p>With this input:</p>
-<pre class="highlight pig"><span class="n">recommendations</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$recommendations'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">recs</span><span class="p">{</span><span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">score</span><span class="p">:</span><span class="n">double</span><span class="p">});</span>
-</pre>
+<pre class="highlight pig"><code><span class="n">recommendations</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'$recommendations'</span> <span class="k">AS</span> <span class="p">(</span><span class="n">user_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">recs</span><span class="p">{</span><span class="n">item_id</span><span class="p">:</span><span class="n">int</span><span class="p">,</span> <span class="n">score</span><span class="p">:</span><span class="n">double</span><span class="p">});</span>
+</code></pre>
+
 <p>We want to produce the exact same output, but with fewer items per user -- let&#39;s say no more than 10.</p>
 
-<h3 id="toc_23">Naive approach</h3>
+<h3 id="naive-approach">Naive approach</h3>
 
 <p>We can randomize using Pig&#39;s default Sample command.</p>
-<pre class="highlight pig"><span class="n">results</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">recommendations</span> <span class="p">{</span>
+<pre class="highlight pig"><code><span class="n">results</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">recommendations</span> <span class="p">{</span>
   <span class="n">sampled</span> <span class="o">=</span> <span class="k">SAMPLE</span> <span class="n">recs</span> <span class="mi">1</span><span class="p">;</span>
   <span class="n">limitted</span> <span class="o">=</span> <span class="k">LIMIT</span> <span class="n">recs</span> <span class="mi">10</span><span class="p">;</span>
   <span class="k">GENERATE</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">limited</span> <span class="k">AS</span> <span class="n">recs</span><span class="p">;</span>
 <span class="p">}</span>
-</pre>
+</code></pre>
+
 <p>The problem of this approach is that results are sampled from the population in a uniformly random fashion. The score you created with your learning algorithm does not have any effect on generating final results.</p>
 
-<h3 id="toc_24">The DataFu you most likely need -- WeightedSample</h3>
+<h3 id="the-datafu-you-most-likely-need-weightedsample">The DataFu you most likely need -- WeightedSample</h3>
 
 <p>We should use that score we generated to help bias our sample.</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">WeightedSample</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">sampling</span><span class="p">.</span><span class="n">WeightedSample</span><span class="p">();</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">WeightedSample</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">sampling</span><span class="p">.</span><span class="n">WeightedSample</span><span class="p">();</span>
 <span class="n">results</span> <span class="o">=</span> <span class="k">FOREACH</span> <span class="n">recommendations</span> <span class="k">GENERATE</span> <span class="n">user_id</span><span class="p">,</span> <span class="n">WeightedSample</span><span class="p">(</span><span class="n">recs</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span>
 <span class="c1">-- from recs, using index 1(second column) as weight, select up to 10 items
-</span></pre>
+</span></code></pre>
+
 <p>Fortunately, <code>WeightedSample</code> can do exactly that. It will randomly select from the candidates, but the scores of each candidate will be used as the probability of whether the candidate will be seleceted or not. So, the tuples with higher weight will have a higher chance to be included in sample - perfect.</p>
 
-<h2 id="toc_25">Additional Examples</h2>
+<h2 id="additional-examples">Additional Examples</h2>
 
 <p>If you&#39;ve made it this far into the post, you deserve an encore. So here are two more examples of how DataFu can make writing pig a bit simpler for you:</p>
 
-<h2 id="toc_26">Filtering with In</h2>
+<h2 id="filtering-with-in">Filtering with In</h2>
 
 <p>One case where conditional logic can be painful is filtering based on a set of values. Suppose you want to filter tuples based on a field equalling one of a list of values. In Pig this can be achieved by joining a list of conditional checks with OR:</p>
-<pre class="highlight pig"><span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">what</span><span class="p">:</span><span class="n">chararray</span><span class="p">,</span> <span class="n">adj</span><span class="p">:</span><span class="n">chararray</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">what</span><span class="p">:</span><span class="n">chararray</span><span class="p">,</span> <span class="n">adj</span><span class="p">:</span><span class="n">chararray</span><span class="p">);</span>
 
 <span class="k">dump</span> <span class="n">data</span><span class="p">;</span>
 <span class="c1">-- (roses,red)
@@ -489,9 +509,10 @@
 <span class="k">dump</span> <span class="n">data2</span><span class="p">;</span>
 <span class="c1">-- (roses,red)
 -- (violets,blue)
-</span></pre>
+</span></code></pre>
+
 <p>However as the number of items to check for grows this becomes very verbose. The <code>In</code> filter function solves this and makes the resulting code very concise:</p>
-<pre class="highlight pig"><span class="k">DEFINE</span> <span class="n">In</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">In</span><span class="p">();</span>
+<pre class="highlight pig"><code><span class="k">DEFINE</span> <span class="n">In</span> <span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">In</span><span class="p">();</span>
 
 <span class="n">data</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">what</span><span class="p">:</span><span class="n">chararray</span><span class="p">,</span> <span class="n">adj</span><span class="p">:</span><span class="n">chararray</span><span class="p">);</span>
 
@@ -505,11 +526,12 @@
 <span class="k">dump</span> <span class="n">data2</span><span class="p">;</span>
 <span class="c1">-- (roses,red)
 -- (violets,blue)
-</span></pre>
-<h2 id="toc_27">Left Outer Join of three or more relations with EmptyBagToNullFields</h2>
+</span></code></pre>
+
+<h2 id="left-outer-join-of-three-or-more-relations-with-emptybagtonullfields">Left Outer Join of three or more relations with EmptyBagToNullFields</h2>
 
 <p>Pig&#39;s <code>JOIN</code> operator supports performing left outer joins on two relations only. If you want to perform a join on more than two relations you have two options. One is to perform a sequence of joins.</p>
-<pre class="highlight pig"><span class="n">input1</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input1'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">val1</span><span class="p">:</span><span class="n">INT</span><span class="p">,</span><span class="n">val2</span><span class="p">:</span><span class="n">INT</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="n">input1</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input1'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">val1</span><span class="p">:</span><span class="n">INT</span><span class="p">,</span><span class="n">val2</span><span class="p">:</span><span class="n">INT</span><span class="p">);</span>
 <span class="n">input2</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input2'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">val1</span><span class="p">:</span><span class="n">INT</span><span class="p">,</span><span class="n">val2</span><span class="p">:</span><span class="n">INT</span><span class="p">);</span>
 <span class="n">input3</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input3'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">val1</span><span class="p">:</span><span class="n">INT</span><span class="p">,</span><span class="n">val2</span><span class="p">:</span><span class="n">INT</span><span class="p">);</span>
 
@@ -518,24 +540,26 @@
 
 <span class="n">data2</span> <span class="o">=</span> <span class="k">JOIN</span> <span class="n">data1</span> <span class="k">BY</span> <span class="n">input1</span><span class="p">::</span><span class="n">val1</span> <span class="k">LEFT</span><span class="p">,</span> <span class="n">input3</span> <span class="k">BY</span> <span class="n">val1</span><span class="p">;</span>
 <span class="n">data2</span> <span class="o">=</span> <span class="k">FILTER</span> <span class="n">data2</span> <span class="k">BY</span> <span class="n">input1</span><span class="p">::</span><span class="n">val1</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="n">NULL</span><span class="p">;</span>
-</pre>
+</code></pre>
+
 <p>However this can be inefficient as it requires multiple MapReduce jobs. For many situations, a better option is to use a single <code>COGROUP</code> which requires only a single MapReduce job. However the code gets pretty ugly.</p>
-<pre class="highlight pig"><span class="n">input1</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input1'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">val1</span><span class="p">:</span><span class="n">INT</span><span class="p">,</span><span class="n">val2</span><span class="p">:</span><span class="n">INT</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="n">input1</span> <span class="o">=</span> <span class="k">LOAD</span> <span class="s1">'input1'</span> <span class="k">using</span> <span class="n">PigStorage</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="n">val1</span><span class="p">:</span><span class="n">INT</span><span class="p">,</span><span class="n">val2</span><span class="p">:</span><span class="n">INT</span><span class="p">);</span>

[... 77 lines stripped ...]



Mime
View raw message