datafu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mha...@apache.org
Subject svn commit: r1709884 [1/8] - in /incubator/datafu/site: ./ blog/ blog/2012/01/10/ blog/2013/01/24/ blog/2013/09/04/ blog/2013/10/03/ blog/2014/04/27/ community/ docs/ docs/datafu/ docs/datafu/guide/ docs/hourglass/ javascripts/ stylesheets/
Date Wed, 21 Oct 2015 17:00:40 GMT
Author: mhayes
Date: Wed Oct 21 17:00:40 2015
New Revision: 1709884

URL: http://svn.apache.org/viewvc?rev=1709884&view=rev
Log:
Update datafu website

Added:
    incubator/datafu/site/community/contributing.html
    incubator/datafu/site/docs/quick-start.html
Removed:
    incubator/datafu/site/docs/datafu/contributing.html
    incubator/datafu/site/docs/hourglass/contributing.html
Modified:
    incubator/datafu/site/blog/2012/01/10/introducing-datafu.html
    incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html
    incubator/datafu/site/blog/2013/09/04/datafu-1-0.html
    incubator/datafu/site/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html
    incubator/datafu/site/blog/2014/04/27/datafu-at-apachecon.html
    incubator/datafu/site/blog/index.html
    incubator/datafu/site/community/mailing-lists.html
    incubator/datafu/site/docs/datafu/getting-started.html
    incubator/datafu/site/docs/datafu/guide.html
    incubator/datafu/site/docs/datafu/guide/bag-operations.html
    incubator/datafu/site/docs/datafu/guide/estimation.html
    incubator/datafu/site/docs/datafu/guide/hashing.html
    incubator/datafu/site/docs/datafu/guide/link-analysis.html
    incubator/datafu/site/docs/datafu/guide/more-tips-and-tricks.html
    incubator/datafu/site/docs/datafu/guide/sampling.html
    incubator/datafu/site/docs/datafu/guide/sessions.html
    incubator/datafu/site/docs/datafu/guide/set-operations.html
    incubator/datafu/site/docs/datafu/guide/statistics.html
    incubator/datafu/site/docs/datafu/javadoc.html
    incubator/datafu/site/docs/hourglass/concepts.html
    incubator/datafu/site/docs/hourglass/getting-started.html
    incubator/datafu/site/docs/hourglass/javadoc.html
    incubator/datafu/site/index.html
    incubator/datafu/site/javascripts/all.js
    incubator/datafu/site/sitemap.xml
    incubator/datafu/site/stylesheets/all.css
    incubator/datafu/site/stylesheets/highlight.css

Modified: incubator/datafu/site/blog/2012/01/10/introducing-datafu.html
URL: http://svn.apache.org/viewvc/incubator/datafu/site/blog/2012/01/10/introducing-datafu.html?rev=1709884&r1=1709883&r2=1709884&view=diff
==============================================================================
--- incubator/datafu/site/blog/2012/01/10/introducing-datafu.html (original)
+++ incubator/datafu/site/blog/2012/01/10/introducing-datafu.html Wed Oct 21 17:00:40 2015
@@ -1,3 +1,5 @@
+
+
 <!doctype html>
 <html>
   <head>
@@ -10,11 +12,9 @@
     <!-- Use title if it's in the page YAML frontmatter -->
     <title>Introducing DataFu, an open source collection of useful Apache Pig UDFs</title>
     
-    <link href="/stylesheets/all.css" media="screen" rel="stylesheet" type="text/css"
/>
-<link href="/stylesheets/highlight.css" media="screen" rel="stylesheet" type="text/css"
/>
-    <script src="/javascripts/all.js" type="text/javascript"></script>
+    <link href="/stylesheets/all.css" rel="stylesheet" /><link href="/stylesheets/highlight.css"
rel="stylesheet" />
+    <script src="/javascripts/all.js"></script>
 
-    
     <script type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-30533336-2']);
@@ -26,14 +26,14 @@
         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga,
s);
       })();
     </script>
-    
   </head>
   
   <body class="blog blog_2012 blog_2012_01 blog_2012_01_10 blog_2012_01_10_introducing-datafu">
 
     <div class="container">
 
-      <div class="header">
+      
+<div class="header">
 
   <ul class="nav nav-pills pull-right">
     <li><a href="/">Home</a></li>
@@ -49,9 +49,7 @@
   <article class="col-lg-10">
     <h1>Introducing DataFu, an open source collection of useful Apache Pig UDFs</h1>
     <h5 class="text-muted"><time>Jan 10, 2012</time></h5>
-    
       <h5 class="text-muted">Matthew Hayes</h5>
-    
 
     <hr>
 
@@ -61,7 +59,7 @@
 
 <p>DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag
operations, and a comprehensive suite of tests. Read on to learn more.</p>
 
-<h3 id="toc_0">What&#39;s included?</h3>
+<h3 id="what-39-s-included">What&#39;s included?</h3>
 
 <p>Here&#39;s a taste of what you can do with DataFu:</p>
 
@@ -74,7 +72,7 @@
 <li>And <a href="/docs/datafu/1.2.0/">lots more</a>.</li>
 </ul>
 
-<h3 id="toc_1">Example: Computing Quantiles</h3>
+<h3 id="example-computing-quantiles">Example: Computing Quantiles</h3>
 
 <p>Let&#39;s walk through an example of how we could use DataFu. We will compute
<a href="http://en.wikipedia.org/wiki/Quantile">quantiles</a> for a fake data
set. You can grab all the code for this example, including scripts to generate test data,
from this gist.</p>
 
@@ -85,7 +83,7 @@
 <p>We can use DataFu to compute quantiles using the <a href="/docs/datafu/1.2.0/datafu/pig/stats/Quantile.html">Quantile
UDF</a>. The constructor for the UDF takes the quantiles to be computed. In this case
we provide 0.25, 0.5, and 0.75 to compute the 25th, 50th, and 75th percentiles (a.k.a <a
href="http://en.wikipedia.org/wiki/Quartile">quartiles</a>). We also provide 0.0
and 1.0 to compute the min and max.</p>
 
 <p>Quantile UDF example script:</p>
-<pre class="highlight pig"><span class="k">define</span> <span class="n">Quartile</span>
<span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span
class="p">.</span><span class="n">stats</span><span class="p">.</span><span
class="n">Quantile</span><span class="p">(</span><span class="s1">'0.0'</span><span
class="p">,</span><span class="s1">'0.25'</span><span class="p">,</span><span
class="s1">'0.5'</span><span class="p">,</span><span class="s1">'0.75'</span><span
class="p">,</span><span class="s1">'1.0'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">define</span> <span
class="n">Quartile</span> <span class="n">datafu</span><span class="p">.</span><span
class="n">pig</span><span class="p">.</span><span class="n">stats</span><span
class="p">.</span><span class="n">Quantile</span><span class="p">(</span><span
class="s1">'0.0'</span><span class="p">,</span><span class="s1">'0.25'</span><span
class="p">,</span><span class="s1">'0.5'</span><span class="p">,</span><span
class="s1">'0.75'</span><span class="p">,</span><span class="s1">'1.0'</span><span
class="p">);</span>
 
 <span class="n">temperature</span> <span class="o">=</span> <span
class="k">LOAD</span> <span class="s1">'temperature.txt'</span> <span
class="k">AS</span> <span class="p">(</span><span class="n">id</span><span
class="p">:</span><span class="n">chararray</span><span class="p">,</span>
<span class="n">temp</span><span class="p">:</span><span class="n">double</span><span
class="p">);</span>
 
@@ -97,20 +95,22 @@
 <span class="p">}</span>
 
 <span class="k">DUMP</span> <span class="n">temperature_quartiles</span>
-</pre>
+</code></pre>
+
 <p>Quantile UDF example output, 10,000 measurements:</p>
-<pre class="highlight text">(1,(41.58171454288797,56.559375253601715,59.91093458980706,63.335574106080365,79.2841731889925))
+<pre class="highlight plaintext"><code>(1,(41.58171454288797,56.559375253601715,59.91093458980706,63.335574106080365,79.2841731889925))
 (2,(14.393515179526304,43.39558395897533,50.081758806889766,56.54245916209963,91.03574746442487))
 (3,(29.865710766927595,37.86257868882021,39.97075970657039,41.989584898364704,51.31349575866486))
-</pre>
+</code></pre>
+
 <p>The values in each row of the output are the min, 25th percentile, 50th percentile
(median), 75th percentile, and max.</p>
 
-<h3 id="toc_2">StreamingQuantile UDF</h3>
+<h3 id="streamingquantile-udf">StreamingQuantile UDF</h3>
 
 <p>The Quantile UDF determines the quantiles by reading the input values for a key
in sorted order and picking out the quantiles based on the size of the input DataBag. Alternatively
we can estimate quantiles using the <a href="/docs/datafu/1.2.0/datafu/pig/stats/StreamingQuantile.html">StreamingQuantile
UDF</a>, contributed to DataFu by <a href="http://www.linkedin.com/pub/josh-wills/0/82b/138">Josh
Wills of Cloudera</a>, which does not require that the input data be sorted.</p>
 
 <p>StreamingQuantile UDF example script:</p>
-<pre class="highlight pig"><span class="k">define</span> <span class="n">Quartile</span>
<span class="n">datafu</span><span class="p">.</span><span class="n">pig</span><span
class="p">.</span><span class="n">stats</span><span class="p">.</span><span
class="n">StreamingQuantile</span><span class="p">(</span><span class="s1">'0.0'</span><span
class="p">,</span><span class="s1">'0.25'</span><span class="p">,</span><span
class="s1">'0.5'</span><span class="p">,</span><span class="s1">'0.75'</span><span
class="p">,</span><span class="s1">'1.0'</span><span class="p">);</span>
+<pre class="highlight pig"><code><span class="k">define</span> <span
class="n">Quartile</span> <span class="n">datafu</span><span class="p">.</span><span
class="n">pig</span><span class="p">.</span><span class="n">stats</span><span
class="p">.</span><span class="n">StreamingQuantile</span><span class="p">(</span><span
class="s1">'0.0'</span><span class="p">,</span><span class="s1">'0.25'</span><span
class="p">,</span><span class="s1">'0.5'</span><span class="p">,</span><span
class="s1">'0.75'</span><span class="p">,</span><span class="s1">'1.0'</span><span
class="p">);</span>
 
 <span class="n">temperature</span> <span class="o">=</span> <span
class="k">LOAD</span> <span class="s1">'temperature.txt'</span> <span
class="k">AS</span> <span class="p">(</span><span class="n">id</span><span
class="p">:</span><span class="n">chararray</span><span class="p">,</span>
<span class="n">temp</span><span class="p">:</span><span class="n">double</span><span
class="p">);</span>
 
@@ -122,39 +122,43 @@
 <span class="p">}</span>
 
 <span class="k">DUMP</span> <span class="n">temperature_quartiles</span>
-</pre>
+</code></pre>
+
 <p>StreamingQuantile UDF example output, 10,000 measurements:</p>
-<pre class="highlight text">(1,(41.58171454288797,56.24183579452584,59.61727093346221,62.919576028265375,79.2841731889925))
+<pre class="highlight plaintext"><code>(1,(41.58171454288797,56.24183579452584,59.61727093346221,62.919576028265375,79.2841731889925))
 (2,(14.393515179526304,42.55929349057328,49.50432161293486,56.020101184758644,91.03574746442487))
 (3,(29.865710766927595,37.64744333815733,39.84941055349095,41.77693877565934,51.31349575866486))
-</pre>
+</code></pre>
+
 <p>Notice that the 25th, 50th, and 75th percentile values computed by StreamingQuantile
are fairly close to the exact values computed by Quantile.</p>
 
-<h3 id="toc_3">Accuracy vs. Runtime</h3>
+<h3 id="accuracy-vs-runtime">Accuracy vs. Runtime</h3>
 
 <p>StreamingQuantile samples the data with in-memory buffers. It implements the <a
href="http://pig.apache.org/docs/r0.7.0/udf.html#Accumulator+Interface">Accumulator interface</a>,
which makes it much more efficient than the Quantile UDF for very large input data. Where
Quantile needs access to all the input data, StreamingQuantile can be fed the data incrementally.
With Quantile, the input data will be spilled to disk as the DataBag is materialized if it
is too large to fit in memory. For very large input data, this can be significant.</p>
 
 <p>To demonstrate this, we can change our experiment so that instead of processing
three sets of 10,000 measurements, we will process three sets of 1 billion. Let’s compare
the output of Quantile and StreamingQuantile on this data set:</p>
 
 <p>Quantile UDF example output, 1 billion measurements:</p>
-<pre class="highlight text">(1,(30.524038,56.62764,60.000134,63.372384,90.561695))
+<pre class="highlight plaintext"><code>(1,(30.524038,56.62764,60.000134,63.372384,90.561695))
 (2,(-9.845137,43.25512,49.999536,56.74441,109.714687))
 (3,(21.564769,37.976644,40.000025,42.023622,58.057268))
-</pre>
+</code></pre>
+
 <p>StreamingQuantile UDF example output, 1 billion measurements:</p>
-<pre class="highlight text">(1,(30.524038,55.993967,59.488968,62.775554,90.561695))
+<pre class="highlight plaintext"><code>(1,(30.524038,55.993967,59.488968,62.775554,90.561695))
 (2,(-9.845137,41.95725,48.977708,55.554239,109.714687))
 (3,(21.564769,37.569332,39.692373,41.666762,58.057268))
-</pre>
+</code></pre>
+
 <p>The 25th, 50th, and 75th percentile values computed using StreamingQuantile are
only estimates, but they are pretty close to the exact values computed with Quantile. With
StreamingQuantile and Quantile there is a tradeoff between accuracy and runtime. The script
using Quantile takes <strong>5 times as long</strong> to run as the one using
StreamingQuantile when the input is the three sets of 1 billion measurements.</p>
 
-<h3 id="toc_4">Testing</h3>
+<h3 id="testing">Testing</h3>
 
 <p>DataFu has a suite of unit tests for each UDF. Instead of just testing the Java
code for a UDF directly, which might overlook issues with the way the UDF works in an actual
Pig script, we used <a href="http://pig.apache.org/docs/r0.8.1/pigunit.html">PigUnit</a>
to do our testing. This let us run Pig scripts locally and still integrate our tests into
a framework such as <a href="http://www.junit.org/">JUnit</a> or <a href="http://testng.org/">TestNG</a>.</p>
 
 <p>We have also integrated the code coverage tracking tool <a href="http://cobertura.sourceforge.net/">Cobertura</a>
into our Ant build file. This helps us flag areas in DataFu which lack sufficient testing.</p>
 
-<h3 id="toc_5">Conclusion</h3>
+<h3 id="conclusion">Conclusion</h3>
 
 <p>We hope this gives you a taste of what you can do with DataFu. We are accepting
contributions, so if you are interested in helping out, please fork the code and send us your
pull requests!</p>
 
@@ -163,8 +167,9 @@
 </div>
 
     
-      <div class="footer">
-Copyright &copy; 2011-2014 <a href="http://www.apache.org/licenses/">The Apache
Software Foundation</a>. <br>
+      
+<div class="footer">
+Copyright &copy; 2011-2015 <a href="http://www.apache.org/licenses/">The Apache
Software Foundation</a>. <br>
 Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather
logo are either registered trademarks or trademarks of the Apache Software Foundation in the
United States and other countries.
 </div>
 

Modified: incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html
URL: http://svn.apache.org/viewvc/incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html?rev=1709884&r1=1709883&r2=1709884&view=diff
==============================================================================
--- incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html (original)
+++ incubator/datafu/site/blog/2013/01/24/datafu-the-wd-40-of-big-data.html Wed Oct 21 17:00:40
2015
@@ -1,3 +1,5 @@
+
+
 <!doctype html>
 <html>
   <head>
@@ -10,11 +12,9 @@
     <!-- Use title if it's in the page YAML frontmatter -->
     <title>DataFu, The WD-40 of Big Data</title>
     
-    <link href="/stylesheets/all.css" media="screen" rel="stylesheet" type="text/css"
/>
-<link href="/stylesheets/highlight.css" media="screen" rel="stylesheet" type="text/css"
/>
-    <script src="/javascripts/all.js" type="text/javascript"></script>
+    <link href="/stylesheets/all.css" rel="stylesheet" /><link href="/stylesheets/highlight.css"
rel="stylesheet" />
+    <script src="/javascripts/all.js"></script>
 
-    
     <script type="text/javascript">
       var _gaq = _gaq || [];
       _gaq.push(['_setAccount', 'UA-30533336-2']);
@@ -26,14 +26,14 @@
         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga,
s);
       })();
     </script>
-    
   </head>
   
   <body class="blog blog_2013 blog_2013_01 blog_2013_01_24 blog_2013_01_24_datafu-the-wd-40-of-big-data">
 
     <div class="container">
 
-      <div class="header">
+      
+<div class="header">
 
   <ul class="nav nav-pills pull-right">
     <li><a href="/">Home</a></li>
@@ -49,9 +49,7 @@
   <article class="col-lg-10">
     <h1>DataFu, The WD-40 of Big Data</h1>
     <h5 class="text-muted"><time>Jan 24, 2013</time></h5>
-    
       <h5 class="text-muted">Matthew Hayes, Sam Shah</h5>
-    
 
     <hr>
 
@@ -90,10 +88,10 @@ G = foreach F generate
 
 <p>You can grab sample data and code you can run on your own for this sessionization
example below.</p>
 
-<h3 id="toc_0">Sessionization Example</h3>
+<h3 id="sessionization-example">Sessionization Example</h3>
 
 <p>Suppose that we have a stream of page views from which we have extracted a member
ID and UNIX timestamp. It might look something like this:</p>
-<pre class="highlight text">memberId timestamp      url
+<pre class="highlight plaintext"><code>memberId timestamp      url
 1        1357718725941  /
 1        1357718871442  /profile
 1        1357719038706  /inbox
@@ -102,11 +100,12 @@ G = foreach F generate
 2        1357752955401  /inbox
 2        1357752982385  /profile
 ...
-</pre>
+</code></pre>
+
 <p>The full data set for this example can be found <a href="https://gist.github.com/raw/4614332/8231534822295e4626af75b3341239177ec44fbe/clicks.csv">here</a>.</p>
 
 <p>Using DataFu we can assign session IDs to each of these events and group by session
ID in order to compute the length of each session. From there we can complete the exercise
by simply applying the statistics UDFs provided by DataFu.</p>
-<pre class="highlight pig"><span class="k">REGISTER</span> <span class="n">piggybank</span><span
class="p">.</span><span class="n">jar</span><span class="p">;</span>
+<pre class="highlight pig"><code><span class="k">REGISTER</span>
<span class="n">piggybank</span><span class="p">.</span><span class="n">jar</span><span
class="p">;</span>
 <span class="k">REGISTER</span> <span class="n">datafu</span><span
class="o">-</span><span class="mi">0</span><span class="p">.</span><span
class="mi">0</span><span class="p">.</span><span class="mi">6</span><span
class="p">.</span><span class="n">jar</span><span class="p">;</span>
 <span class="k">REGISTER</span> <span class="n">guava</span><span
class="o">-</span><span class="mi">13</span><span class="p">.</span><span
class="mi">0</span><span class="p">.</span><span class="mi">1</span><span
class="p">.</span><span class="n">jar</span><span class="p">;</span>
<span class="c1">-- needed by StreamingQuantile
 </span>
@@ -149,16 +148,18 @@ G = foreach F generate
 
 <span class="k">DUMP</span> <span class="n">session_stats</span>
 <span class="c1">--(15.737532575757575,31.29552045993877,(2.848041666666667),(14.648516666666666,31.88788333333333,86.69525))
-</span></pre>
-<p>This is just a taste. There’s plenty more in the library for you to peruse.
Take a look <a href="http://data.linkedin.com/opensource/datafu">here</a>. DataFu
is freely available under the Apache 2 license. We welcome contributions, so please send us
your pull requests!</p>
+</span></code></pre>
+
+<p>This is just a taste. There’s plenty more in the library for you to peruse.
Take a look <a href="/docs/datafu/guide.html">here</a>. DataFu is freely available
under the Apache 2 license. We welcome contributions, so please send us your pull requests!</p>
 
 
   </article>
 </div>
 
     
-      <div class="footer">
-Copyright &copy; 2011-2014 <a href="http://www.apache.org/licenses/">The Apache
Software Foundation</a>. <br>
+      
+<div class="footer">
+Copyright &copy; 2011-2015 <a href="http://www.apache.org/licenses/">The Apache
Software Foundation</a>. <br>
 Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather
logo are either registered trademarks or trademarks of the Apache Software Foundation in the
United States and other countries.
 </div>
 



Mime
View raw message