datafu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mha...@apache.org
Subject [2/2] incubator-datafu git commit: Update documentation and fix various issues
Date Wed, 21 Oct 2015 16:55:28 GMT
Update documentation and fix various issues

I've reviewed the website content and made many changes to make sure it reflects the current state of the code base.  Most of the changes have to do with updating the Hourglass instructions.  I also made some changes to the Hourglass build so support the demo, which was lost in the Gradle migration away from Ant.  I've also fixed many of the links so they point to the right location.


Project: http://git-wip-us.apache.org/repos/asf/incubator-datafu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-datafu/commit/87f55b42
Tree: http://git-wip-us.apache.org/repos/asf/incubator-datafu/tree/87f55b42
Diff: http://git-wip-us.apache.org/repos/asf/incubator-datafu/diff/87f55b42

Branch: refs/heads/master
Commit: 87f55b425035201036e5823bdcd081b6ec7c962f
Parents: 6435437
Author: Matthew Hayes <matthew.terence.hayes@gmail.com>
Authored: Thu Oct 15 18:37:07 2015 -0700
Committer: Matthew Hayes <matthew.terence.hayes@gmail.com>
Committed: Wed Oct 21 09:55:14 2015 -0700

----------------------------------------------------------------------
 README.md                                       |   4 +-
 datafu-hourglass/README.md                      | 311 +------------------
 datafu-hourglass/build.gradle                   |  97 ++++--
 .../datafu/pig/stats/HyperLogLogPlusPlus.java   |  34 +-
 site/Gemfile                                    |   4 +-
 site/Gemfile.lock                               | 200 +++++++-----
 site/README.md                                  |   2 +-
 site/lib/pig.rb                                 |  58 ++--
 ...-01-24-datafu-the-wd-40-of-big-data.markdown |  18 +-
 site/source/blog/2013-09-04-datafu-1-0.markdown | 110 +++----
 ...cremental-data-processing-in-hadoop.markdown | 126 ++++----
 .../source/community/contributing.html.markdown |  83 +++++
 .../community/mailing-lists.html.markdown       |   3 +-
 .../docs/datafu/contributing.html.markdown      |  68 ----
 .../datafu/getting-started.html.markdown.erb    |  33 +-
 site/source/docs/datafu/guide.html.markdown.erb |   4 +-
 .../docs/datafu/javadoc.html.markdown.erb       |   2 +-
 .../docs/hourglass/contributing.html.markdown   |  41 ---
 .../hourglass/getting-started.html.markdown.erb |  87 ++----
 .../docs/hourglass/javadoc.html.markdown.erb    |   2 +-
 site/source/docs/quick-start.html.markdown.erb  |  89 ++++++
 site/source/index.markdown.erb                  |   6 +-
 site/source/layouts/_docs_nav.erb               |   9 +-
 site/source/layouts/_footer.erb                 |   2 +-
 24 files changed, 606 insertions(+), 787 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index 0fa5112..d2aac70 100644
--- a/README.md
+++ b/README.md
@@ -24,7 +24,7 @@ If you'd like to jump in and get started, check out the corresponding guides for
 * [DataFu 1.0](http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html)
 * [DataFu's Hourglass: Incremental Data Processing in Hadoop](http://datafu.incubator.apache.org/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html)
 
-## Presentations 
+## Presentations
 
 * [A Brief Tour of DataFu](http://www.slideshare.net/matthewterencehayes/datafu)
 * [Building Data Products at LinkedIn with DataFu](http://www.slideshare.net/matthewterencehayes/building-data-products-at-linkedin-with-datafu)
@@ -101,7 +101,7 @@ To run tests for a single class, use the `test.single` property.  For example, t
 
     ./gradlew :datafu-pig:test -Dtest.single=QuantileTests
 
-The tests can also be run from within eclipse.  You'll need to install the TestNG plugin for Eclipse.  See: http://testng.org/doc/download.html. 
+The tests can also be run from within eclipse.  You'll need to install the TestNG plugin for Eclipse.  See: http://testng.org/doc/download.html.
 
 Potential issues and workaround:
 * You may run out of heap when executing tests in Eclipse. To fix this adjust your heap settings for the TestNG plugin. Go to Eclipse->Preferences. Select TestNG->Run/Debug. Add "-Xmx1G" to the JVM args.

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/datafu-hourglass/README.md
----------------------------------------------------------------------
diff --git a/datafu-hourglass/README.md b/datafu-hourglass/README.md
index 31d2361..3b4a908 100644
--- a/datafu-hourglass/README.md
+++ b/datafu-hourglass/README.md
@@ -1,312 +1,3 @@
 # DataFu: Hourglass
 
-Hourglass is a framework for incrementally processing partitioned data sets in Hadoop.  
-
-## Quick Start Example
-
-Let's walk through a use case where Hourglass is helpful.  Suppose that we have a website that tracks a particular event,
-and for each event a member ID is recorded.  These events are collected and stored in HDFS in Avro under paths having the
-format /data/event/yyyy/MM/dd.  Suppose for this example our Avro schema is:
-
-```json
-{
-  "type" : "record",
-  "name" : "ExampleEvent",
-  "namespace" : "datafu.hourglass.test",
-  "fields" : [ {
-    "name" : "id",
-    "type" : "long",
-    "doc" : "ID"
-  } ]
-}
-```
-
-Suppose that the goal is to count how many times this event has occurred per member over the entire history and produce
-a daily report summarizing these counts.  One solution is to simply consume all data under /data/event each day and 
-aggregate by member ID.  However, most of the data is the same day to day.  The only difference is that each day a new
-day of data appears in HDFS.  So while this solution works, it is wasteful.  Wouldn't it be better if we could merge
-the previous result with the new data?  With Hourglass you can.
-
-To continue our example, let's say there are two days of data currently available, 2013/03/15 and 2013/03/16, and that
-their contents are:
-
-```
-2013/03/15:
-{"id": 1}
-{"id": 1}
-{"id": 1}
-{"id": 2}
-{"id": 3}
-{"id": 3}
-
-2013/03/16:
-{"id": 1}
-{"id": 1}
-{"id": 2}
-{"id": 2}
-{"id": 3}
-```
-
-Let's aggregate the counts by member ID using Hourglass.  To perform the aggregation we will use `PartitionCollapsingIncrementalJob`,
-which essentially takes a partitioned data set like the one we have and collapses all the partitions together into a single output.
-
-```Java
-PartitionCollapsingIncrementalJob job = new PartitionCollapsingIncrementalJob(Example.class);
-```
-
-Next we will define the schemas for the key and value used by the job.  The key affects how data is grouped in the reducer when
-we perform the aggregation.  In this case it will be the member ID.  The value is the piece of data being aggregated, which will
-be an integer representing the count in this case.  Hourglass uses Avro for its data types.  Let's define the schemas:
-
-```Java
-final String namespace = "com.example";
-
-final Schema keySchema = Schema.createRecord("Key",null,namespace,false);
-keySchema.setFields(Arrays.asList(new Field("member_id",Schema.create(Type.LONG),null,null)));
-final String keySchemaString = keySchema.toString(true);
-
-final Schema valueSchema = Schema.createRecord("Value",null,namespace,false);
-valueSchema.setFields(Arrays.asList(new Field("count",Schema.create(Type.INT),null,null)));
-final String valueSchemaString = valueSchema.toString(true);
-```
-
-This produces schemas having the following representation:
-
-```json
-{
-  "type" : "record",
-  "name" : "Key",
-  "namespace" : "com.example",
-  "fields" : [ {
-    "name" : "member_id",
-    "type" : "long"
-  } ]
-}
-
-{
-  "type" : "record",
-  "name" : "Value",
-  "namespace" : "com.example",
-  "fields" : [ {
-    "name" : "count",
-    "type" : "int"
-  } ]
-}
-```
-
-Now we can tell the job what our schemas are.  Hourglass allows two different value types.  One is the intermediate value type
-that is output by the mapper and combiner.  The other is the output value type, the output of the reducer.  In this case we
-will use the same value type for each.
-
-```Java
-job.setKeySchema(keySchema);
-job.setIntermediateValueSchema(valueSchema);
-job.setOutputValueSchema(valueSchema);
-```
-
-Next we will tell Hourglass where to find the data, where to write the data, and that we want to reuse the previous output.
-
-```Java
-job.setInputPaths(Arrays.asList(new Path("/data/event")));
-job.setOutputPath(new Path("/output"));
-job.setReusePreviousOutput(true);
-```
-
-Now let's get into some application logic.  The mapper will produce a key-value pair from each record consisting of
-the member ID and a count, which for each input record will just be `1`.
-
-```Java
-job.setMapper(new Mapper<GenericRecord,GenericRecord,GenericRecord>() 
-{
-  private transient Schema kSchema;
-  private transient Schema vSchema;
-  
-  @Override
-  public void map(GenericRecord input,
-                  KeyValueCollector<GenericRecord, GenericRecord> collector) throws IOException,
-      InterruptedException
-  {
-    if (kSchema == null) kSchema = new Schema.Parser().parse(keySchemaString);
-    if (vSchema == null) vSchema = new Schema.Parser().parse(valueSchemaString);
-    GenericRecord key = new GenericData.Record(kSchema);
-    key.put("member_id", input.get("id"));
-    GenericRecord value = new GenericData.Record(vSchema);
-    value.put("count", 1);
-    collector.collect(key,value);
-  }      
-});
-```
-
-An accumulator is responsible for aggregating this data.  Records will be grouped by member and then passed to the accumulator
-one-by-one.  The accumulator keeps a running count and adds each input count to it.  When all data has been passed to it
-the `getFinal()` method will be called, which returns the output record containing the count.
-
-```Java
-job.setReducerAccumulator(new Accumulator<GenericRecord,GenericRecord>() 
-{
-  private transient int count;
-  private transient Schema vSchema;
-  
-  @Override
-  public void accumulate(GenericRecord value)
-  {
-    this.count += (Integer)value.get("count");
-  }
-
-  @Override
-  public GenericRecord getFinal()
-  {
-    if (vSchema == null) vSchema = new Schema.Parser().parse(valueSchemaString);
-    GenericRecord output = new GenericData.Record(vSchema);
-    output.put("count", count);
-    return output;
-  }
-
-  @Override
-  public void cleanup()
-  {
-    this.count = 0;
-  }      
-});
-```
-
-Since the intermediate and output values have the same schema, the accumulator can also be used for the combiner,
-so let's indicate that we want it to be used for that:
-
-```Java
-job.setCombinerAccumulator(job.getReducerAccumulator());
-job.setUseCombiner(true);
-```
-
-Finally, we run the job.
-
-```Java
-job.run();
-```
-
-When we inspect the output we find that the counts match what we expect:
-
-```
-{"key": {"member_id": 1}, "value": {"count": 5}}
-{"key": {"member_id": 2}, "value": {"count": 3}}
-{"key": {"member_id": 3}, "value": {"count": 3}}
-```
-
-Now suppose that a new day of data becomes available:
-
-```
-2013/03/17:
-{"id": 1}
-{"id": 1}
-{"id": 2}
-{"id": 2}
-{"id": 2}
-{"id": 3}
-{"id": 3}
-```
-
-Let's run the job again.
-Since Hourglass already has a result for the previous day, it consumes the new day of input and the previous output, rather
-than all the input data it already processed.  The previous output is passed to the accumulator implementation where it is
-aggregated with the new data.  This produces the output we expect:
-
-```json
-{"key": {"member_id": 1}, "value": {"count": 7}}
-{"key": {"member_id": 2}, "value": {"count": 6}}
-{"key": {"member_id": 3}, "value": {"count": 5}}
-```
-
-In this example we only have a few days of input data, so the impact of incrementally processing the new data is small.
-However, as the size of the input data grows, the benefit of incrementally processing data becomes very significant.
-
-## Motivation
-
-Data sets that are temporal in nature very often are stored in such a way that each directory corresponds to a separate
-time range.  For example, one convention could be to divide the data by day.  One benefit of a partitioning scheme such 
-as this is that it makes it possible to consume a subset of the data for specific time ranges, instead of consuming
-the entire data set.
-
-Very often computations on data such as this are performed daily over sliding windows.  For example, a metric of interest
-may be the last time each member logged into the site.  The most straightforward implementation is to consume all login events
-across all days.  This is inefficient however since day-to-day the input data is mostly the same.  
-A more efficient solution is to merge the previous output with new data since the last run.  As a result there is less 
-data to process.  Another metric of interest may be the number of pages viewed per member over the last 30 days.  
-A straightforward implementation is to consume the page view data over the last 30 days each time the job runs.
-However, again, the input data is mostly the same day-to-day.
-Instead, given the previous output of the job, the new output can be produced by adding the new data and subtracting the old data.
-
-Although these incremental jobs are conceptually easy to understand, the implementations can be complex.  
-Hourglass defines an easy-to-use programming model and provides jobs for incrementally processing partitioned data as just described.
-It handles the underlying details and complexity of an incremental system so that programmers can focus on
-application logic.
-
-## Capabilities
-
-Hourglass uses Avro for input, intermediate, and output data.  Input data must be partitioned by day according to the
-naming convention yyyy/MM/dd.  Joining multiple inputs is supported.
-
-Hourglass provides two types of jobs: partition-preserving and partition-collapsing.  A *partition-preserving* job 
-consumes input data partitioned by day and produces output data partitioned by day. This is equivalent to running a 
-MapReduce job for each individual day of input data, but much more efficient.  It compares the input data against 
-the existing output data and only processes input data with no corresponding output. A *partition-collapsing* job 
-consumes input data partitioned by day and produces a single output.  What distinguishes this job from a standard 
-MapReduce job is that it can reuse the previous output.  This enables it to process data much more efficiently.  
-Rather than consuming all input data on each run, it can consume only the new data since the previous run and 
-merges it with the previous output.  Since the partition-preserving job output partitioned data, the two jobs
-can be chained together.
-
-Given these two jobs, processing data over sliding windows can be done much more efficiently.  There are two types
-of sliding windows that are of particular interest that can be implemented using Hourglass.  
-A *fixed-start* sliding window has a start date that remains the same over multiple runs and an end date that is 
-flexible, where the end date advances forward as new input data becomes available.
-A *fixed-length* sliding window has a window length that remains the same over multiple runs and flexible start
-and end dates, where the start and end advance forward as new input data becomes available.
-Hourglass makes defining these sliding windows easy.
-
-An example of a fixed-start sliding window problem is computing the last login time for all members of a website
-using a login event that records the login time.  This could be solved efficiently by using a partition-collapsing
-job, which is capable of reusing the previous output and merging it with new login data as it arrives.
-
-An example of a fixed-length sliding window problem is computing the pages viewed per member of the last 30 days
-using a page-view event.  This could also be solved efficiently using a partition-collapsing job, which is
-capable of reusing the previous output and merging it with the new page-view data while subtracting off the old
-page-view data.
-
-For some problems it is not possible to subtract off the oldest day of data for a fixed-length sliding window problem.
-For example, suppose the goal is to estimate the distinct number of members who have logged into a website in the last
-30 days using a login event that records the member ID.  A HyperLogLog counter could be used to estimate the cardinality.
-The internal data for this counter could be serialized to bytes and stored as output alongside the estimated count.
-However, although multiple HyperLogLog counters can be merged together, they cannot be subtracted or unmerged.
-In other words the operation is not reversible.  So a partition-collapsing job by itself could not be used.
-However we could chain together a partition-preserving and partition-collapsing job.  The first job would estimate cardinality
-per day and store the value with the counter's byte representation.  The second job would merge the per day counters
-together to produce the estimate over the full 30 day window.  This makes the computation extremely efficient.
-
-## Programming Model
-
-To implement an incremental job, a developer must specify Avro schemas for the key, intermediate value, and output value types.
-The key and intermediate value types are used for the output of the mapper and an optional combiner.  The key and output value
-types are used for the output of the reducer.  The input schemas are automatically determined by the job by inspecting the
-input data.
-
-A developer must also define their application logic through interfaces that are based on the MapReduce programming model.
-The mapper is defined through a *Mapper* interface, which given a record produces zero or more key-value pairs.
-The key-value pairs must conform to the key and intermediate value schemas just mentioned.
-
-Reduce is defined through an *Accumulator* interface, which is passed all the records for a given key and then returns
-the final result.  Both the combiner and reducer use an Accumulator for the reduce functionality.
-This is similar to the standard reduce function of MapReduce.  The key difference is that
-no more than one record can be returned.  The input records to the Accumulator are of the intermediate value type; 
-the ouput is of the output value type.  
-
-If the intermediate and output value types are the same then the Accumulator can
-naturally be used to merge the new input data with the previous output.  However if they are different then a class implementing
-the *Merge* interface must be provided.  Merge is a a binary operation on two records of the output value type that returns a 
-record as a result.
-
-The other case where an implementation of Merge must be provided is when output is reused for a partition-collapsing job
-over a fixed-length sliding window.  Merge in this case is used to essentially subtract old output from the current output.
-
-## Contribute
-
-The source code is available under the Apache 2.0 license.  Contributions are welcome.
\ No newline at end of file
+Hourglass is a framework for incrementally processing partitioned data sets in Hadoop.  Please see [Getting Started](http://datafu.incubator.apache.org/docs/hourglass/getting-started.html) and [Concepts](http://datafu.incubator.apache.org/docs/hourglass/concepts.html) for more information.

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/datafu-hourglass/build.gradle
----------------------------------------------------------------------
diff --git a/datafu-hourglass/build.gradle b/datafu-hourglass/build.gradle
index 867cc6d..027d8f4 100644
--- a/datafu-hourglass/build.gradle
+++ b/datafu-hourglass/build.gradle
@@ -32,6 +32,33 @@ buildscript {
   }
 }
 
+// Core dependencies are those that we explicitly depend on and are not picked up as transitive dependencies of Hadoop.
+configurations.create('core')
+configurations.create('testCore')
+
+// Hadoop dependencies are those that we need for compile time but are provided by Hadoop when jobs are submitted.
+configurations.create('hadoop')
+
+// Dependencies needed in order to run the demo against Hadoop.
+configurations.create('demoRuntime')
+
+configurations {
+  // compile is split into core and hadoop configurations
+  compile {
+    extendsFrom core, hadoop
+  }
+
+  testCompile {
+    extendsFrom compile, testCore
+  }
+
+  // hadoopRuntime is an alternative to runtime that excludes all the hadoop jars (in the hadoop configuration) and
+  // their transitive dependencies.
+  demoRuntime {
+    extendsFrom core, testCore
+  }
+}
+
 cleanEclipse {
   doLast {
     delete ".apt_generated"
@@ -41,34 +68,68 @@ cleanEclipse {
   }
 }
 
+task testJar(type: Jar) {
+  classifier = 'tests'
+  from sourceSets.test.output
+}
+
+def demoDependenciesDir = new File("$buildDir/demo_dependencies")
+
+task replace_demo_dependencies_dir() << {
+  println "Creating $demoDependenciesDir"
+
+  if (demoDependenciesDir.exists()) {
+    demoDependenciesDir.deleteDir()
+  }
+  demoDependenciesDir.mkdirs()
+}
+
+task copyDemoDependencies(type: Copy, dependsOn: replace_demo_dependencies_dir) {
+    from configurations.demoRuntime
+    into demoDependenciesDir
+    def copyDetails = []
+    eachFile { copyDetails << it }
+    doLast {
+      copyDetails.each { FileCopyDetails details ->
+        def target = new File(demoDependenciesDir, details.path)
+        if(target.exists()) {
+          target.setLastModified(details.lastModified)
+        }
+      }
+    }
+}
+
 dependencies {
   // core dependencies, listed as dependencies in pom
-  compile "log4j:log4j:$log4jVersion"
-  compile "org.json:json:$jsonVersion"
-  compile "org.apache.avro:avro:$avroVersion"
-  compile "org.apache.avro:avro-compiler:$avroVersion"
-  compile "org.apache.commons:commons-math:$commonsMathVersion"
+  core "log4j:log4j:$log4jVersion"
+  core "org.json:json:$jsonVersion"
+  core "org.apache.avro:avro:$avroVersion"
+  core "org.apache.avro:avro-compiler:$avroVersion"
+  core "org.apache.commons:commons-math:$commonsMathVersion"
 
   // needed for testing, not listed as a dependencies in pom
-  testCompile "com.clearspring.analytics:stream:$streamVersion"
+  testCore "com.clearspring.analytics:stream:$streamVersion"
+  testCompile "commons-io:commons-io:$commonsIoVersion"
   testCompile "javax.ws.rs:jsr311-api:$jsr311Version"
   testCompile "org.slf4j:slf4j-log4j12:$slf4jVersion"
-  testCompile "commons-io:commons-io:$commonsIoVersion"
   testCompile "org.testng:testng:$testngVersion"
+
+  // only needed for running the demo
+  demoRuntime "org.apache.avro:avro-tools:$avroVersion"
 }
 
 if (hadoopVersion.startsWith("2.") || hadoopVersion.startsWith("0.23.")) {
   dependencies {
     // core dependencies, listed as dependencies in pom
-    compile "org.apache.avro:avro-mapred:$avroVersion:hadoop2"
+    core "org.apache.avro:avro-mapred:$avroVersion:hadoop2"
 
     // needed for compilation and testing, not listed as a dependencies in pom
-    compile "org.apache.hadoop:hadoop-common:$hadoopVersion"
-    compile "org.apache.hadoop:hadoop-hdfs:$hadoopVersion"
-    compile "org.apache.hadoop:hadoop-mapreduce-client-jobclient:$hadoopVersion"
-    compile "org.apache.hadoop:hadoop-archives:$hadoopVersion"
-    compile "org.apache.hadoop:hadoop-auth:$hadoopVersion"
-    compile "org.apache.hadoop:hadoop-mapreduce-client-core:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-common:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-hdfs:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-mapreduce-client-jobclient:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-archives:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-auth:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-mapreduce-client-core:$hadoopVersion"
 
     // needed for testing, not listed as a dependencies in pom
     testCompile "org.apache.hadoop:hadoop-minicluster:$hadoopVersion"
@@ -76,17 +137,17 @@ if (hadoopVersion.startsWith("2.") || hadoopVersion.startsWith("0.23.")) {
 } else {
   dependencies {
     // core dependencies, listed as dependencies in pom
-    compile "org.apache.avro:avro-mapred:$avroVersion"
+    core "org.apache.avro:avro-mapred:$avroVersion"
 
     // needed for compilation and testing, not listed as a dependencies in pom
-    compile "org.apache.hadoop:hadoop-core:$hadoopVersion"
-    compile "org.apache.hadoop:hadoop-tools:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-core:$hadoopVersion"
+    hadoop "org.apache.hadoop:hadoop-tools:$hadoopVersion"
 
     // needed for testing, not listed as a dependencies in pom
     testCompile "org.apache.hadoop:hadoop-test:$hadoopVersion"
   }
 }
- 
+
 
 // modify the pom dependencies so we don't include hadoop and the testing related artifacts
 modifyPom {

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/datafu-pig/src/main/java/datafu/pig/stats/HyperLogLogPlusPlus.java
----------------------------------------------------------------------
diff --git a/datafu-pig/src/main/java/datafu/pig/stats/HyperLogLogPlusPlus.java b/datafu-pig/src/main/java/datafu/pig/stats/HyperLogLogPlusPlus.java
index 2068801..104e99b 100644
--- a/datafu-pig/src/main/java/datafu/pig/stats/HyperLogLogPlusPlus.java
+++ b/datafu-pig/src/main/java/datafu/pig/stats/HyperLogLogPlusPlus.java
@@ -40,24 +40,24 @@ import com.clearspring.analytics.stream.cardinality.HyperLogLogPlus;
 
 /**
  * A UDF that applies the HyperLogLog++ cardinality estimation algorithm.
- * 
+ *
  * <p>
  * This uses the implementation of HyperLogLog++ from <a href="https://github.com/addthis/stream-lib" target="_blank">stream-lib</a>.
- * The HyperLogLog++ algorithm is an enhanced version of HyperLogLog as described in 
+ * The HyperLogLog++ algorithm is an enhanced version of HyperLogLog as described in
  * <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf">here</a>.
  * </p>
- * 
+ *
  * <p>
  * This is a streaming implementation, and therefore the input data does not need to be sorted.
  * </p>
- * 
+ *
  */
 public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
 {
   private static TupleFactory mTupleFactory = TupleFactory.getInstance();
 
   private String p;
-  
+
   /**
    * Constructs a HyperLogLog++ estimator.
    */
@@ -65,11 +65,11 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
   {
     this("20");
   }
-  
+
   /**
    * Constructs a HyperLogLog++ estimator.
-   * 
-   * @param par precision value
+   *
+   * @param p precision value
    */
   public HyperLogLogPlusPlus(String p)
   {
@@ -77,7 +77,7 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
     this.p = p;
     cleanup();
   }
-    
+
   @Override
   public Schema outputSchema(Schema input)
   {
@@ -86,21 +86,21 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
       {
         throw new RuntimeException("Expected input to have only a single field");
       }
-      
+
       Schema.FieldSchema inputFieldSchema = input.getField(0);
 
       if (inputFieldSchema.type != DataType.BAG)
       {
         throw new RuntimeException("Expected a BAG as input");
       }
-      
+
       return new Schema(new Schema.FieldSchema(null, DataType.LONG));
     }
     catch (FrontendException e) {
       throw new RuntimeException(e);
     }
   }
-  
+
   private String param = null;
   private String getParam()
   {
@@ -113,7 +113,7 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
     }
     return param;
   }
-  
+
   @Override
   public String getFinal() {
       return Final.class.getName() + getParam();
@@ -133,7 +133,7 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
 	public Initial() {};
 	public Initial(String p) {};
 
-	  
+
     @Override
     public Tuple exec(Tuple input) throws IOException {
       // Since Initial is guaranteed to be called
@@ -156,7 +156,7 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
 	};
 	private String p;
 	public Intermediate(String p) {this.p = p;};
-	  
+
     @Override
     public Tuple exec(Tuple input) throws IOException {
       try {
@@ -179,7 +179,7 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
 	};
 	private String p;
 	public Final(String p) {this.p = p;};
-	  
+
     @Override
     public Long exec(Tuple input) throws IOException {
       try {
@@ -217,5 +217,5 @@ public class HyperLogLogPlusPlus extends AlgebraicEvalFunc<Long>
     }
     return estimator;
   }
-  
+
 }

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/Gemfile
----------------------------------------------------------------------
diff --git a/site/Gemfile b/site/Gemfile
index 0360e4f..01cb75f 100644
--- a/site/Gemfile
+++ b/site/Gemfile
@@ -17,10 +17,10 @@
 
 source 'http://rubygems.org'
 
-gem "middleman", "~>3.2.0"
+gem "middleman"
 
 # Live-reloading plugin
-gem "middleman-livereload", "~> 3.1.0"
+gem "middleman-livereload"
 
 gem "middleman-blog"
 

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/Gemfile.lock
----------------------------------------------------------------------
diff --git a/site/Gemfile.lock b/site/Gemfile.lock
index 81a3b4b..3578a36 100644
--- a/site/Gemfile.lock
+++ b/site/Gemfile.lock
@@ -1,118 +1,147 @@
 GEM
   remote: http://rubygems.org/
   specs:
-    activesupport (3.2.16)
-      i18n (~> 0.6, >= 0.6.4)
-      multi_json (~> 1.0)
-    addressable (2.3.5)
-    atomic (1.1.14)
-    builder (3.1.4)
-    chunky_png (1.2.9)
-    coffee-script (2.2.0)
+    activesupport (4.2.4)
+      i18n (~> 0.7)
+      json (~> 1.7, >= 1.7.7)
+      minitest (~> 5.1)
+      thread_safe (~> 0.3, >= 0.3.4)
+      tzinfo (~> 1.1)
+    addressable (2.3.8)
+    builder (3.2.2)
+    capybara (2.4.4)
+      mime-types (>= 1.16)
+      nokogiri (>= 1.3.3)
+      rack (>= 1.0.0)
+      rack-test (>= 0.5.4)
+      xpath (~> 2.0)
+    chunky_png (1.3.4)
+    coffee-script (2.4.1)
       coffee-script-source
       execjs
-    coffee-script-source (1.6.3)
-    commonjs (0.2.6)
-    compass (0.12.2)
+    coffee-script-source (1.9.1.1)
+    commonjs (0.2.7)
+    compass (1.0.3)
       chunky_png (~> 1.2)
-      fssm (>= 0.2.7)
-      sass (~> 3.1)
-    em-websocket (0.5.0)
-      eventmachine (>= 0.12.9)
-      http_parser.rb (~> 0.5.3)
-    eventmachine (1.0.3)
-    execjs (1.4.0)
+      compass-core (~> 1.0.2)
+      compass-import-once (~> 1.0.5)
+      rb-fsevent (>= 0.9.3)
+      rb-inotify (>= 0.9)
+      sass (>= 3.3.13, < 3.5)
+    compass-core (1.0.3)
       multi_json (~> 1.0)
-    ffi (1.9.3)
-    fssm (0.2.10)
-    haml (4.0.4)
+      sass (>= 3.3.0, < 3.5)
+    compass-import-once (1.0.5)
+      sass (>= 3.2, < 3.5)
+    em-websocket (0.5.1)
+      eventmachine (>= 0.12.9)
+      http_parser.rb (~> 0.6.0)
+    erubis (2.7.0)
+    eventmachine (1.0.8)
+    execjs (2.6.0)
+    ffi (1.9.10)
+    haml (4.0.7)
       tilt
     hike (1.2.3)
-    http_parser.rb (0.5.3)
-    i18n (0.6.9)
-    kramdown (1.3.0)
-    less (2.2.2)
-      commonjs (~> 0.2.6)
-    libv8 (3.16.14.3)
-    listen (1.3.1)
+    hooks (0.4.1)
+      uber (~> 0.0.14)
+    http_parser.rb (0.6.0)
+    i18n (0.7.0)
+    json (1.8.3)
+    kramdown (1.9.0)
+    less (2.6.0)
+      commonjs (~> 0.2.7)
+    libv8 (3.16.14.11)
+    listen (3.0.3)
       rb-fsevent (>= 0.9.3)
       rb-inotify (>= 0.9)
-      rb-kqueue (>= 0.2)
-    middleman (3.2.0)
-      coffee-script (~> 2.2.0)
-      compass (>= 0.12.2)
-      execjs (~> 1.4.0)
-      haml (>= 3.1.6)
+    middleman (3.4.0)
+      coffee-script (~> 2.2)
+      compass (>= 1.0.0, < 2.0.0)
+      compass-import-once (= 1.0.5)
+      execjs (~> 2.0)
+      haml (>= 4.0.5)
       kramdown (~> 1.2)
-      middleman-core (= 3.2.0)
+      middleman-core (= 3.4.0)
       middleman-sprockets (>= 3.1.2)
-      sass (>= 3.1.20)
-      uglifier (~> 2.1.0)
-    middleman-blog (3.5.0)
+      sass (>= 3.4.0, < 4.0)
+      uglifier (~> 2.5)
+    middleman-blog (3.5.3)
       addressable (~> 2.3.5)
       middleman-core (~> 3.2)
       tzinfo (>= 0.3.0)
-    middleman-core (3.2.0)
-      activesupport (~> 3.2.6)
+    middleman-core (3.4.0)
+      activesupport (~> 4.1)
       bundler (~> 1.1)
-      i18n (~> 0.6.1)
-      listen (~> 1.1)
-      rack (>= 1.4.5)
-      rack-test (~> 0.6.1)
+      capybara (~> 2.4.4)
+      erubis
+      hooks (~> 0.3)
+      i18n (~> 0.7.0)
+      listen (~> 3.0.3)
+      padrino-helpers (~> 0.12.3)
+      rack (>= 1.4.5, < 2.0)
       thor (>= 0.15.2, < 2.0)
-      tilt (~> 1.3.6)
-    middleman-livereload (3.1.0)
-      em-websocket (>= 0.2.0)
-      middleman-core (>= 3.0.2)
-      multi_json (~> 1.0)
-      rack-livereload
-    middleman-sprockets (3.2.0)
+      tilt (~> 1.4.1, < 2.0)
+    middleman-livereload (3.4.3)
+      em-websocket (~> 0.5.1)
+      middleman-core (>= 3.3)
+      rack-livereload (~> 0.3.15)
+    middleman-sprockets (3.4.2)
+      middleman-core (>= 3.3)
+      sprockets (~> 2.12.1)
+      sprockets-helpers (~> 1.1.0)
+      sprockets-sass (~> 1.3.0)
+    middleman-syntax (2.0.0)
       middleman-core (~> 3.2)
-      sprockets (~> 2.1)
-      sprockets-helpers (~> 1.0.0)
-      sprockets-sass (~> 1.0.0)
-    middleman-syntax (1.2.1)
-      middleman-core (~> 3.0)
-      rouge (~> 0.3.0)
-    multi_json (1.8.2)
-    nokogiri (1.5.6)
-    rack (1.5.2)
-    rack-livereload (0.3.15)
+      rouge (~> 1.0)
+    mime-types (2.6.2)
+    mini_portile (0.6.2)
+    minitest (5.8.1)
+    multi_json (1.11.2)
+    nokogiri (1.6.6.2)
+      mini_portile (~> 0.6.0)
+    padrino-helpers (0.12.5)
+      i18n (~> 0.6, >= 0.6.7)
+      padrino-support (= 0.12.5)
+      tilt (~> 1.4.1)
+    padrino-support (0.12.5)
+      activesupport (>= 3.1)
+    rack (1.6.4)
+    rack-livereload (0.3.16)
       rack
-    rack-test (0.6.2)
+    rack-test (0.6.3)
       rack (>= 1.0)
-    rb-fsevent (0.9.3)
-    rb-inotify (0.9.2)
+    rb-fsevent (0.9.6)
+    rb-inotify (0.9.5)
       ffi (>= 0.5.0)
-    rb-kqueue (0.2.0)
-      ffi (>= 0.5.0)
-    redcarpet (2.2.2)
-    ref (1.0.5)
-    rouge (0.3.10)
-      thor
-    sass (3.2.12)
-    sprockets (2.10.1)
+    redcarpet (3.3.3)
+    ref (2.0.0)
+    rouge (1.10.1)
+    sass (3.4.19)
+    sprockets (2.12.4)
       hike (~> 1.2)
       multi_json (~> 1.0)
       rack (~> 1.0)
       tilt (~> 1.1, != 1.3.0)
-    sprockets-helpers (1.0.1)
+    sprockets-helpers (1.1.0)
       sprockets (~> 2.0)
-    sprockets-sass (1.0.2)
+    sprockets-sass (1.3.1)
       sprockets (~> 2.0)
       tilt (~> 1.1)
-    therubyracer (0.12.1)
+    therubyracer (0.12.2)
       libv8 (~> 3.16.14.0)
       ref
-    thor (0.18.1)
-    thread_safe (0.1.3)
-      atomic
-    tilt (1.3.7)
-    tzinfo (1.1.0)
+    thor (0.19.1)
+    thread_safe (0.3.5)
+    tilt (1.4.1)
+    tzinfo (1.2.2)
       thread_safe (~> 0.1)
-    uglifier (2.1.2)
+    uber (0.0.15)
+    uglifier (2.7.2)
       execjs (>= 0.3.0)
-      multi_json (~> 1.0, >= 1.0.2)
+      json (>= 1.8.0)
+    xpath (2.0.0)
+      nokogiri (~> 1.3)
 
 PLATFORMS
   ruby
@@ -120,11 +149,14 @@ PLATFORMS
 DEPENDENCIES
   builder
   less
-  middleman (~> 3.2.0)
+  middleman
   middleman-blog
-  middleman-livereload (~> 3.1.0)
+  middleman-livereload
   middleman-syntax
   nokogiri
   redcarpet
   therubyracer
   wdm (~> 0.1.0)
+
+BUNDLED WITH
+   1.10.6

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/README.md
----------------------------------------------------------------------
diff --git a/site/README.md b/site/README.md
index dd2777d..eda98c9 100644
--- a/site/README.md
+++ b/site/README.md
@@ -1,6 +1,6 @@
 # Apache DataFu website
 
-We use [Middleman](http://middlemanapp.com/) to generate the website content. This requires Ruby.
+We use [Middleman](http://middlemanapp.com/) to generate the website content. This requires Ruby.  It's highly recommended that you use something like [rbenv](https://github.com/sstephenson/rbenv) to manage your Ruby versions.  The website content has been successfully generated using Ruby version `2.2.2`.
 
 ## Setup
 

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/lib/pig.rb
----------------------------------------------------------------------
diff --git a/site/lib/pig.rb b/site/lib/pig.rb
index f6ad912..6b7cd48 100644
--- a/site/lib/pig.rb
+++ b/site/lib/pig.rb
@@ -28,8 +28,8 @@ class Pig < Rouge::RegexLexer
       ASSERT COGROUP CROSS DEFINE DISTINCT FILTER
       FOREACH GROUP IMPORT JOIN LIMIT LOAD MAPREDUCE
       ORDER BY SAMPLE SPLIT STORE STREAM UNION
-      GENERATE ALL DUMP AS REGISTER USING ASC DESC ANY 
-      FULL INNER OUTER EXEC DESCRIBE CASE EXPLAIN 
+      GENERATE ALL DUMP AS REGISTER USING ASC DESC ANY
+      FULL INNER OUTER EXEC DESCRIBE CASE EXPLAIN
       ILLUSTRATE IS INTO IF LEFT RIGHT MATCHES PARALLEL
       ROLLUP SHIP AND OR NOT
 
@@ -39,51 +39,51 @@ class Pig < Rouge::RegexLexer
   end
 
   state :root do
-    rule /\s+/m, 'Text'
-    rule /--.*?\n/, 'Comment.Single'
-    rule %r(/\*), 'Comment.Multiline', :multiline_comments
-    rule /\d+/, 'Literal.Number.Integer'
-    rule /'/, 'Literal.String.Single', :single_string
-    rule /"/, 'Name.Variable', :double_string
-    rule /`/, 'Name.Variable', :backtick
+    rule /\s+/m, Text
+    rule /--.*?\n/, Comment::Single
+    rule %r(/\*), Comment::Multiline, :multiline_comments
+    rule /\d+/, Num::Integer
+    rule /'/, Str::Single, :single_string
+    rule /"/, Name::Variable, :double_string
+    rule /`/, Name::Variable, :backtick
 
     rule /[$]?\w[\w\d]*/ do |m|
       if self.class.keywords.include? m[0].upcase
-        token 'Keyword'
+        token Keyword
       else
-        token 'Name'
+        token Name
       end
     end
 
-    rule %r([+*/<>=~!@#%^&|?^-]), 'Operator'
-    rule /[;:(){}\[\],.]/, 'Punctuation'
+    rule %r([+*/<>=~!@#%^&|?^-]), Operator
+    rule /[;:(){}\[\],.]/, Punctuation
   end
 
   state :multiline_comments do
-    rule %r(/[*]), 'Comment.Multiline', :multiline_comments
-    rule %r([*]/), 'Comment.Multiline', :pop!
-    rule %r([^/*]+), 'Comment.Multiline'
-    rule %r([/*]), 'Comment.Multiline'
+    rule %r(/[*]), Comment::Multiline, :multiline_comments
+    rule %r([*]/), Comment::Multiline, :pop!
+    rule %r([^/*]+), Comment::Multiline
+    rule %r([/*]), Comment::Multiline
   end
 
   state :backtick do
-    rule /\\./, 'Literal.String.Escape'
-    rule /``/, 'Literal.String.Escape'
-    rule /`/, 'Name.Variable', :pop!
-    rule /[^\\`]+/, 'Name.Variable'
+    rule /\\./, Str::Escape
+    rule /``/, Str::Escape
+    rule /`/, Name::Variable, :pop!
+    rule /[^\\`]+/, Name::Variable
   end
 
   state :single_string do
-    rule /\\./, 'Literal.String.Escape'
-    rule /''/, 'Literal.String.Escape'
-    rule /'/, 'Literal.String.Single', :pop!
-    rule /[^\\']+/, 'Literal.String.Single'
+    rule /\\./, Str::Escape
+    rule /''/, Str::Escape
+    rule /'/, Str::Single, :pop!
+    rule /[^\\']+/, Str::Single
   end
 
   state :double_string do
-    rule /\\./, 'Literal.String.Escape'
-    rule /""/, 'Literal.String.Escape'
-    rule /"/, 'Name.Variable', :pop!
-    rule /[^\\"]+/, 'Name.Variable'
+    rule /\\./, Str::Escape
+    rule /""/, Str::Escape
+    rule /"/, Name::Variable, :pop!
+    rule /[^\\"]+/, Name::Variable
   end
 end

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/blog/2013-01-24-datafu-the-wd-40-of-big-data.markdown
----------------------------------------------------------------------
diff --git a/site/source/blog/2013-01-24-datafu-the-wd-40-of-big-data.markdown b/site/source/blog/2013-01-24-datafu-the-wd-40-of-big-data.markdown
index dec65cd..0ebd80b 100644
--- a/site/source/blog/2013-01-24-datafu-the-wd-40-of-big-data.markdown
+++ b/site/source/blog/2013-01-24-datafu-the-wd-40-of-big-data.markdown
@@ -75,34 +75,34 @@ Using DataFu we can assign session IDs to each of these events and group by sess
 REGISTER piggybank.jar;
 REGISTER datafu-0.0.6.jar;
 REGISTER guava-13.0.1.jar; -- needed by StreamingQuantile
- 
+
 DEFINE UnixToISO   org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();
 DEFINE Sessionize  datafu.pig.sessions.Sessionize('10m');
 DEFINE Median      datafu.pig.stats.Median();
 DEFINE Quantile    datafu.pig.stats.StreamingQuantile('0.75','0.90','0.95');
 DEFINE VAR         datafu.pig.stats.VAR();
- 
+
 pv = LOAD 'clicks.csv' USING PigStorage(',') AS (memberId:int, time:long, url:chararray);
- 
+
 pv = FOREACH pv
      -- Sessionize expects an ISO string
      GENERATE UnixToISO(time) as isoTime,
               time,
               memberId;
- 
+
 pv_sessionized = FOREACH (GROUP pv BY memberId) {
   ordered = ORDER pv BY isoTime;
   GENERATE FLATTEN(Sessionize(ordered)) AS (isoTime, time, memberId, sessionId);
 };
- 
+
 pv_sessionized = FOREACH pv_sessionized GENERATE sessionId, time;
- 
+
 -- compute length of each session in minutes
 session_times = FOREACH (GROUP pv_sessionized BY sessionId)
                 GENERATE group as sessionId,
                          (MAX(pv_sessionized.time)-MIN(pv_sessionized.time))
                             / 1000.0 / 60.0 as session_length;
- 
+
 -- compute stats on session length
 session_stats = FOREACH (GROUP session_times ALL) {
   ordered = ORDER session_times BY session_length;
@@ -112,9 +112,9 @@ session_stats = FOREACH (GROUP session_times ALL) {
     Median(ordered.session_length) as median_session,
     Quantile(ordered.session_length) as quantiles_session;
 };
- 
+
 DUMP session_stats
 --(15.737532575757575,31.29552045993877,(2.848041666666667),(14.648516666666666,31.88788333333333,86.69525))
 ```
 
-This is just a taste. There’s plenty more in the library for you to peruse. Take a look [here](http://data.linkedin.com/opensource/datafu). DataFu is freely available under the Apache 2 license. We welcome contributions, so please send us your pull requests!
\ No newline at end of file
+This is just a taste. There’s plenty more in the library for you to peruse. Take a look [here](/docs/datafu/guide.html). DataFu is freely available under the Apache 2 license. We welcome contributions, so please send us your pull requests!
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/blog/2013-09-04-datafu-1-0.markdown
----------------------------------------------------------------------
diff --git a/site/source/blog/2013-09-04-datafu-1-0.markdown b/site/source/blog/2013-09-04-datafu-1-0.markdown
index fa64e35..4ca6de4 100644
--- a/site/source/blog/2013-09-04-datafu-1-0.markdown
+++ b/site/source/blog/2013-09-04-datafu-1-0.markdown
@@ -18,11 +18,13 @@ license: >
    limitations under the License.
 ---
 
-[DataFu](http://data.linkedin.com/opensource/datafu) is an open-source collection of user-defined functions for working with large-scale data in [Hadoop](http://hadoop.apache.org/) and [Pig](http://pig.apache.org/).
+_Update (10/15/2015): The links in this blog post have been updated to point to the correct locations within the Apache DataFu website._
+
+[DataFu](/) is an open-source collection of user-defined functions for working with large-scale data in [Hadoop](http://hadoop.apache.org/) and [Pig](http://pig.apache.org/).
 
 About two years ago, we recognized a need for a stable, well-tested library of Pig UDFs that could assist in common data mining and statistics tasks. Over the years, we had developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came [PigUnit](http://pig.apache.org/docs/r0.11.1/test.html#pigunit), which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have the initial release of DataFu.
 
-Since then, the project has continued to evolve. We have accepted contributions from a number of sources, improved the style and quality of testing, and adapted to the changing features and versions of Pig. During this time DataFu has been used extensively at LinkedIn for many of our data driven products like "People You May Known" and "Skills and Endorsements." The library is used at numerous companies, and it has also been included in Cloudera's Hadoop distribution ([CDH](http://www.cloudera.com/content/cloudera/en/products/cdh.html)) as well as the [Apache BigTop](http://bigtop.apache.org/) project. DataFu has matured, and we are proud to announce the [1.0 release](https://github.com/linkedin/datafu/blob/master/changes.md).
+Since then, the project has continued to evolve. We have accepted contributions from a number of sources, improved the style and quality of testing, and adapted to the changing features and versions of Pig. During this time DataFu has been used extensively at LinkedIn for many of our data driven products like "People You May Known" and "Skills and Endorsements." The library is used at numerous companies, and it has also been included in Cloudera's Hadoop distribution ([CDH](http://www.cloudera.com/content/cloudera/en/products/cdh.html)) as well as the [Apache BigTop](http://bigtop.apache.org/) project. DataFu has matured, and we are proud to announce the [1.0 release](/docs/datafu/1.0.0/).
 
 This release of DataFu has a number of new features that can make writing Pig easier, cleaner, and more efficient. In this post, we are going to highlight some of these new features by walking through a large number of examples. Think of this as a HowTo Pig + DataFu guide.
 
@@ -59,23 +61,23 @@ accepts_counted = FOREACH (GROUP accepts BY (user_id, item_id)) GENERATE
   FLATTEN(group) as (user_id, item_id), COUNT_STAR(accepts) as count;
 rejects_counted = FOREACH (GROUP rejects BY (user_id, item_id)) GENERATE
   FLATTEN(group) as (user_id, item_id), COUNT_STAR(rejects) as count;
- 
-joined_accepts = JOIN impressions_counted BY (user_id, item_id) LEFT OUTER, accepts_counted BY (user_id, item_id);  
-joined_accepts = FOREACH joined_accepts GENERATE 
+
+joined_accepts = JOIN impressions_counted BY (user_id, item_id) LEFT OUTER, accepts_counted BY (user_id, item_id);
+joined_accepts = FOREACH joined_accepts GENERATE
   impressions_counted::user_id as user_id,
   impressions_counted::item_id as item_id,
   impressions_counted::count as impression_count,
   ((accepts_counted::count is null)?0:accepts_counted::count) as accept_count;
- 
+
 joined_accepts_rejects = JOIN joined_accepts BY (user_id, item_id) LEFT OUTER, rejects_counted BY (user_id, item_id);
-joined_accepts_rejects = FOREACH joined_accepts_rejects GENERATE 
+joined_accepts_rejects = FOREACH joined_accepts_rejects GENERATE
   joined_accepts::user_id as user_id,
   joined_accepts::item_id as item_id,
   joined_accepts::impression_count as impression_count,
   joined_accepts::accept_count as accept_count,
   ((rejects_counted::count is null)?0:rejects_counted::count) as reject_count;
- 
-features = FOREACH (GROUP joined_accepts_rejects BY user_id) GENERATE 
+
+features = FOREACH (GROUP joined_accepts_rejects BY user_id) GENERATE
   group as user_id, joined_accepts_rejects.(item_id, impression_count, accept_count, reject_count) as items;
 ```
 
@@ -87,13 +89,13 @@ Recognizing that we can combine the outer joins and group operations into a sing
 
 ```pig
 features_grouped = COGROUP impressions BY (user_id, item_id), accepts BY (user_id, item_id), rejects BY (user_id, item_id);
-features_counted = FOREACH features_grouped GENERATE 
+features_counted = FOREACH features_grouped GENERATE
   FLATTEN(group) as (user_id, item_id),
   COUNT_STAR(impressions) as impression_count,
   COUNT_STAR(accepts) as accept_count,
   COUNT_STAR(rejects) as reject_count;
- 
-features = FOREACH (GROUP features_counted BY user_id) GENERATE 
+
+features = FOREACH (GROUP features_counted BY user_id) GENERATE
   group as user_id,
   features_counted.(item_id, impression_count, accept_count, reject_count) as items;
 ```
@@ -110,15 +112,15 @@ One thing that we have noticed is that even very big data will frequently get re
 DEFINE CountEach datafu.pig.bags.CountEach('flatten');
 DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();
 DEFINE Coalesce datafu.pig.util.Coalesce();
- 
+
 features_grouped = COGROUP impressions BY user_id, accepts BY user_id, rejects BY user_id;
- 
-features_counted = FOREACH features_grouped GENERATE 
+
+features_counted = FOREACH features_grouped GENERATE
   group as user_id,
   CountEach(impressions.item_id) as impressions,
   CountEach(accepts.item_id) as accepts,
   CountEach(rejects.item_id) as rejects;
- 
+
 features_joined = FOREACH features_counted GENERATE
   user_id,
   BagLeftOuterJoin(
@@ -126,7 +128,7 @@ features_joined = FOREACH features_counted GENERATE
     accepts, 'item_id',
     rejects, 'item_id'
   ) as items;
- 
+
 features = FOREACH features_joined {
   projected = FOREACH items GENERATE
     impressions::item_id as item_id,
@@ -156,7 +158,7 @@ Next we count the occurences of each item in the impression, accept and reject b
 ```pig
 DEFINE CountEach datafu.pig.bags.CountEach('flatten');
 
-features_counted = FOREACH features_grouped GENERATE 
+features_counted = FOREACH features_grouped GENERATE
     group as user_id,
     CountEach(impressions.item_id) as impressions,
     CountEach(accepts.item_id) as accepts,
@@ -277,33 +279,33 @@ public class MortgagePayment extends AliasableEvalFunc<DataBag> {
       Schema tupleSchema = new Schema();
       tupleSchema.add(new Schema.FieldSchema("monthly_payment", DataType.DOUBLE));
       Schema bagSchema;
-    
+
       bagSchema = new Schema(new Schema.FieldSchema(this.getClass().getName().toLowerCase(), tupleSchema, DataType.BAG));
       return bagSchema;
     } catch (FrontendException e) {
       throw new RuntimeException(e);
     }
   }
- 
+
   @Override
   public DataBag exec(Tuple input) throws IOException  {
     DataBag output = BagFactory.getInstance().newDefaultBag();
-    
+
     // get a value from the input tuple by alias
     Double principal = getDouble(input, "principal");
     Integer numPayments = getInteger(input, "num_payments");
     DataBag interestRates = getBag(input, "interest_rates");
-    
+
     for (Tuple interestTuple : interestRates) {
       // get a value from the inner bag tuple by alias
       Double interest = getDouble(interestTuple, getPrefixedAliasName("interest_rates", "interest_rate"));
       double monthlyPayment = computeMonthlyPayment(principal, numPayments, interest);
       output.add(TupleFactory.getInstance().newTuple(monthlyPayment));
     }
-    
+
     return output;
   }
- 
+
   private double computeMonthlyPayment(Double principal, Integer numPayments, Double interest) {
     return principal * (interest * Math.pow(interest+1, numPayments)) / (Math.pow(interest+1, numPayments) - 1.0);
   }
@@ -322,9 +324,9 @@ The model for a linear regression is pretty simple; it's just a mapping of field
 
 ```pig
 DEFINE LinearRegression datafu.test.blog.LinearRegression('intercept:1,impression_count:-0.1,accept_count:2.0,reject_count:-1.0');
- 
+
 features = LOAD 'test/pig/datafu/test/blog/features.dat' AS (user_id:int, items:bag{(item_id:int,impression_count:int,accept_count:int,reject_count:int)});
- 
+
 recommendations = FOREACH features {
   scored_items = FOREACH items GENERATE item_id, LinearRegression(*) as score;
   GENERATE user_id, scored_items as items;
@@ -339,20 +341,20 @@ Now, the hard work, writing the UDF:
 public class LinearRegression extends AliasableEvalFunc<Double>
 {
   Map<String, Double> parameters;
-  
+
   public LinearRegression(String parameterString) {
     parameters = new HashMap<String, Double>();
     for (String token : parameterString.split(",")) {
       String[] keyValue = token.split(":");
       parameters.put(keyValue[0].trim(), Double.parseDouble(keyValue[1].trim()));
-    }     
+    }
   }
- 
+
   @Override
   public Schema getOutputSchema(Schema input) {
     return new Schema(new Schema.FieldSchema("score", DataType.DOUBLE));
   }
- 
+
   @Override
   public Double exec(Tuple input) throws IOException {
     double score = 0.0;
@@ -408,17 +410,17 @@ The staright-foward solution for this task will be group the tracking data for e
 
 ```pig
 grouped = COGROUP impressions BY (user_id, item_id), accepts BY (user_id, item_id), rejects BY (user_id, item_id) features BY (user_id, item_id);
-full_result = FOREACH grouped GENREATE 
+full_result = FOREACH grouped GENREATE
   FLATTEN(group) AS user_id, item_id,
   (impressions::timestamp is null)?1:0 AS is_impressed,
   (accepts::timestamp is null)?1:0 AS is_accepted,
   (rejects::timestamp is null)?1:0 AS is_rejected,
   Coalesce(features::feature_1, 0) AS feature_1,
   Coalesce(features::feature_2, 0) AS feature_2;
- 
+
 grouped_full_result = GROUP full_result BY user_id;
 sampled = SAMPLE grouped_full_result BY group 0.01;
-result = FOREACH sampled GENERATE 
+result = FOREACH sampled GENERATE
   group AS user_id,
   FLATTEN(full_result);
 ```
@@ -431,14 +433,14 @@ Yep.
 
 ```pig
 DEFINE SampleByKey datafu.pig.sampling.SampleByKey('whatever_the_salt_you_want_to_use','0.01');
- 
+
 impressions = FILTER impressions BY SampleByKey('user_id');
 accepts = FILTER impressions BY SampleByKey('user_id');
 rejects = FILTER rejects BY SampleByKey('user_id');
 features = FILTER features BY SampleByKey('user_id');
- 
+
 grouped = COGROUP impressions BY (user_id, item_id), accepts BY (user_id, item_id), rejects BY (user_id, item_id), features BY (user_id, item_id);
-result = FOREACH grouped GENREATE 
+result = FOREACH grouped GENREATE
   FLATTEN(group) AS (user_id, item_id),
   (impressions::timestamp is null)?1:0 AS is_impressed,
   (accepts::timestamp is null)?1:0 AS is_accepted,
@@ -501,14 +503,14 @@ One case where conditional logic can be painful is filtering based on a set of v
 
 ```pig
 data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
-  
+
 dump data;
 -- (roses,red)
 -- (violets,blue)
 -- (sugar,sweet)
-  
+
 data2 = FILTER data BY adj == 'red' OR adj == 'blue';
-  
+
 dump data2;
 -- (roses,red)
 -- (violets,blue)
@@ -518,16 +520,16 @@ However as the number of items to check for grows this becomes very verbose. The
 
 ```pig
 DEFINE In datafu.pig.util.In();
- 
+
 data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
-  
+
 dump data;
 -- (roses,red)
 -- (violets,blue)
 -- (sugar,sweet)
-  
+
 data2 = FILTER data BY In(adj, 'red','blue');
-  
+
 dump data2;
 -- (roses,red)
 -- (violets,blue)
@@ -541,10 +543,10 @@ Pig's `JOIN` operator supports performing left outer joins on two relations only
 input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
 input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
 input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
-  
+
 data1 = JOIN input1 BY val1 LEFT, input2 BY val1;
 data1 = FILTER data1 BY input1::val1 IS NOT NULL;
-  
+
 data2 = JOIN data1 BY input1::val1 LEFT, input3 BY val1;
 data2 = FILTER data2 BY input1::val1 IS NOT NULL;
 ```
@@ -555,13 +557,13 @@ However this can be inefficient as it requires multiple MapReduce jobs. For many
 input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
 input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
 input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
-  
+
 data1 = COGROUP input1 BY val1, input2 BY val1, input3 BY val1;
 data2 = FOREACH data1 GENERATE
   FLATTEN(input1), -- left join on this
-  FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) 
+  FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2))
       as (input2::val1,input2::val2),
-  FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) 
+  FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3))
       as (input3::val1,input3::val2);
 ```
 
@@ -571,11 +573,11 @@ To clean up this code we have created `EmptyBagToNullFields`, which replicates t
 
 ```pig
 DEFINE EmptyBagToNullFields datafu.pig.bags.EmptyBagToNullFields();
- 
+
 input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
 input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
 input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
-  
+
 data1 = COGROUP input1 BY val1, input2 BY val1, input3 BY val1;
 data2 = FOREACH data1 GENERATE
   FLATTEN(input1),
@@ -592,9 +594,9 @@ Ok, a second encore, but no more. If you are doing a lot of these, you can turn
 ```pig
 DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined {
   cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3;
-  $joined = FOREACH cogrouped GENERATE 
-    FLATTEN($relation1), 
-    FLATTEN(EmptyBagToNullFields($relation2)), 
+  $joined = FOREACH cogrouped GENERATE
+    FLATTEN($relation1),
+    FLATTEN(EmptyBagToNullFields($relation2)),
     FLATTEN(EmptyBagToNullFields($relation3));
 }
 ```
@@ -607,6 +609,6 @@ features = left_outer_join(input1, val1, input2, val2, input3, val3);
 
 ## Wrap-up
 
-So, that's a lot to digest, but it's just a highlight into a few interesting pieces of DataFu. Check out the [DataFu 1.0 release](http://data.linkedin.com/opensource/datafu) as there's even more in store.
+So, that's a lot to digest, but it's just a highlight into a few interesting pieces of DataFu. Check out the [DataFu 1.0 release](/docs/datafu/1.0.0/) as there's even more in store.
 
 We hope that it proves valuable to you and as always welcome any contributions. Please let us know how you're using the library — we would love to hear from you.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown
----------------------------------------------------------------------
diff --git a/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown b/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown
index 9482da8..cd9c73e 100644
--- a/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown
+++ b/site/source/blog/2013-10-03-datafus-hourglass-incremental-data-processing-in-hadoop.markdown
@@ -18,6 +18,8 @@ license: >
    limitations under the License.
 ---
 
+_Update (10/15/2015): The links in this blog post have been updated to point to the correct locations within the Apache DataFu website._
+
 For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that shows the number of visitors to every page on the site over the last thirty days. To keep this dashboard up to date, we can schedule a query that runs daily and gathers the stats for the last 30 days. However, this simple implementation would be wasteful: only one day of data has changed, but we'd be consuming and recalculating the stats for all 30.
 
 A more efficient solution is to make the query incremental: using basic arithmetic, we can update the output from the previous day by adding and subtracting input data. This enables the job to process only the new data, significantly reducing the computational resources required. Unfortunately, although there are many benefits to the incremental approach, getting incremental jobs right is hard:
@@ -26,7 +28,7 @@ A more efficient solution is to make the query incremental: using basic arithmet
 * If the previous output is reused, then the job needs to be written to consume not just new input data, but also previous outputs.
 * There are more things that can go wrong with an incremental job, so you typically need to spend more time writing automated tests to make sure things are working.
 
-To solve these problems, we are happy to announce that we have open sourced [Hourglass](https://github.com/linkedin/datafu/tree/master/contrib/hourglass), a framework that makes it much easier to write incremental Hadoop jobs. We are releasing Hourglass under the Apache 2.0 License as part of the [DataFu](https://github.com/linkedin/datafu) project. We will be presenting our "Hourglass: a Library for Incremental Processing on Hadoop" paper at the [IEEE BigData 2013](http://cci.drexel.edu/bigdata/bigdata2013/index.htm) conference on October 9th.
+To solve these problems, we are happy to announce that we have open sourced [Hourglass](/docs/hourglass/getting-started.html), a framework that makes it much easier to write incremental Hadoop jobs. We are releasing Hourglass under the Apache 2.0 License as part of the [DataFu](/) project. We will be presenting our "Hourglass: a Library for Incremental Processing on Hadoop" paper at the [IEEE BigData 2013](http://cci.drexel.edu/bigdata/bigdata2013/index.htm) conference on October 9th.
 
 In this post, we will give an overview of the basic concepts behind Hourglass and walk through examples of using the framework to solve processing tasks incrementally. The first example presents a job that counts how many times a member has logged in to a site. The second example presents a job that estimates the number of members who have visited in the past thirty days. Lastly, we will show you how to get the code and start writing your own incremental hadoop jobs.
 
@@ -83,7 +85,7 @@ Hourglass uses [Avro](http://avro.apache.org/) for all of the input and output d
 With the basic concepts out of the way, let's look at an example. Suppose that we have a website that tracks user logins as an event, and for each event, the member ID is recorded. These events are collected and stored in HDFS in Avro under paths with the format `/data/event/yyyy/MM/dd`. Suppose for this example our Avro schema is:
 
     {
-      "type" : "record", "name" : "ExampleEvent", 
+      "type" : "record", "name" : "ExampleEvent",
       "namespace" : "datafu.hourglass.test",
       "fields" : [ {
         "name" : "id",
@@ -98,9 +100,9 @@ To continue our example, let's say there are two days of data currently availabl
 
     2013/03/15:
     {"id": 1}, {"id": 1}, {"id": 1}, {"id": 2}, {"id": 3}, {"id": 3}
-     
+
     2013/03/16:
-    {"id": 1}, {"id": 1}, {"id": 2}, {"id": 2}, {"id": 3}, 
+    {"id": 1}, {"id": 1}, {"id": 2}, {"id": 2}, {"id": 3},
 
 Let's aggregate the counts by member ID using Hourglass. To perform the aggregation we will use [PartitionCollapsingIncrementalJob](/docs/hourglass/0.1.3/datafu/hourglass/jobs/PartitionCollapsingIncrementalJob.html), which takes a partitioned data set and collapses all the partitions together into a single output. The goal is to aggregate the two days of input and produce a single day of output, as in the following diagram:
 
@@ -109,7 +111,7 @@ Let's aggregate the counts by member ID using Hourglass. To perform the aggregat
 First, create the job:
 
 ```java
-PartitionCollapsingIncrementalJob job = 
+PartitionCollapsingIncrementalJob job =
     new PartitionCollapsingIncrementalJob(Example.class);
 ```
 
@@ -117,22 +119,22 @@ Next, we will define schemas for the key and value used by the job. The key affe
 
 ```java
 final String namespace = "com.example";
- 
-final Schema keySchema = 
+
+final Schema keySchema =
   Schema.createRecord("Key",null,namespace,false);
- 
+
 keySchema.setFields(Arrays.asList(
   new Field("member_id",Schema.create(Type.LONG),null,null)));
- 
+
 final String keySchemaString = keySchema.toString(true);
- 
-final Schema valueSchema = 
+
+final Schema valueSchema =
   Schema.createRecord("Value",null,namespace,false);
- 
+
 valueSchema.setFields(Arrays.asList(
   new Field("count",Schema.create(Type.INT),null,null)));
 ```
- 
+
 final String valueSchemaString = valueSchema.toString(true);
 
 This produces the following representation:
@@ -144,7 +146,7 @@ This produces the following representation:
         "type" : "long"
       } ]
     }
-     
+
     {
       "type" : "record", "name" : "Value", "namespace" : "com.example",
       "fields" : [ {
@@ -176,58 +178,58 @@ job.setMapper(new Mapper<GenericRecord,GenericRecord,GenericRecord>()
 {
   private transient Schema kSchema;
   private transient Schema vSchema;
-  
+
   @Override
   public void map(
     GenericRecord input,
-    KeyValueCollector<GenericRecord, GenericRecord> collector) 
-  throws IOException, InterruptedException 
+    KeyValueCollector<GenericRecord, GenericRecord> collector)
+  throws IOException, InterruptedException
   {
-    if (kSchema == null) 
+    if (kSchema == null)
       kSchema = new Schema.Parser().parse(keySchemaString);
- 
-    if (vSchema == null) 
+
+    if (vSchema == null)
       vSchema = new Schema.Parser().parse(valueSchemaString);
- 
+
     GenericRecord key = new GenericData.Record(kSchema);
     key.put("member_id", input.get("id"));
- 
+
     GenericRecord value = new GenericData.Record(vSchema);
     value.put("count", 1);
- 
+
     collector.collect(key,value);
-  }      
+  }
 });
 ```
 
 An accumulator is responsible for aggregating this data. Records will be grouped by member ID and then passed to the accumulator one-by-one. The accumulator keeps a running total and adds each input count to it. When all data has been passed to it, the `getFinal()` method will be called, which returns the output record containing the count.
 
 ```java
-job.setReducerAccumulator(new Accumulator<GenericRecord,GenericRecord>() 
+job.setReducerAccumulator(new Accumulator<GenericRecord,GenericRecord>()
 {
   private transient int count;
   private transient Schema vSchema;
-  
+
   @Override
   public void accumulate(GenericRecord value) {
     this.count += (Integer)value.get("count");
   }
- 
+
   @Override
   public GenericRecord getFinal() {
-    if (vSchema == null) 
+    if (vSchema == null)
       vSchema = new Schema.Parser().parse(valueSchemaString);
- 
+
     GenericRecord output = new GenericData.Record(vSchema);
     output.put("count", count);
- 
+
     return output;
   }
- 
+
   @Override
   public void cleanup() {
     this.count = 0;
-  }      
+  }
 });
 ```
 
@@ -289,43 +291,43 @@ HyperLogLog is a good fit for this use case. For this example, we will use [Hype
 Let's start by defining the mapper. The key it uses is just a dummy value, as we are only producing a single statistic in this case. For the value we use a record with two fields: one is the count estimate; the other we'll just call "data", which can be either a single member ID or the bytes from the serialized estimator. For the map output we use the member ID.
 
 ```java
-Mapper<GenericRecord,GenericRecord,GenericRecord> mapper = 
+Mapper<GenericRecord,GenericRecord,GenericRecord> mapper =
   new Mapper<GenericRecord,GenericRecord,GenericRecord>() {
     private transient Schema kSchema;
     private transient Schema vSchema;
-  
+
     @Override
     public void map(
       GenericRecord input,
-      KeyValueCollector<GenericRecord, GenericRecord> collector) 
+      KeyValueCollector<GenericRecord, GenericRecord> collector)
     throws IOException, InterruptedException
     {
-      if (kSchema == null) 
+      if (kSchema == null)
         kSchema = new Schema.Parser().parse(keySchemaString);
-      
-      if (vSchema == null) 
+
+      if (vSchema == null)
         vSchema = new Schema.Parser().parse(valueSchemaString);
-      
+
       GenericRecord key = new GenericData.Record(kSchema);
       key.put("name", "member_count");
-      
+
       GenericRecord value = new GenericData.Record(vSchema);
       value.put("data",input.get("id")); // member id
       value.put("count", 1L);            // just a single member
-      
-      collector.collect(key,value);        
-    }      
+
+      collector.collect(key,value);
+    }
   };
 ```
 
 Next, we'll define the accumulator, which can be used for both the combiner and the reducer. This accumulator can handle either member IDs or estimator bytes. When it receives a member ID it adds it to the HyperLogLog estimator. When it receives an estimator it merges it with the current estimator to produce a new one. To produce the final result, it gets the current estimate and also serializes the current estimator as a sequence of bytes.
 
 ```java
-Accumulator<GenericRecord,GenericRecord> accumulator = 
+Accumulator<GenericRecord,GenericRecord> accumulator =
   new Accumulator<GenericRecord,GenericRecord>() {
     private transient HyperLogLogPlus estimator;
     private transient Schema vSchema;
-  
+
     @Override
     public void accumulate(GenericRecord value)
     {
@@ -341,10 +343,10 @@ Accumulator<GenericRecord,GenericRecord> accumulator =
         HyperLogLogPlus newEstimator;
         try
         {
-          newEstimator = 
+          newEstimator =
             HyperLogLogPlus.Builder.build(bytes.array());
- 
-          estimator = 
+
+          estimator =
             (HyperLogLogPlus)estimator.merge(newEstimator);
         }
         catch (IOException e)
@@ -354,21 +356,21 @@ Accumulator<GenericRecord,GenericRecord> accumulator =
         catch (CardinalityMergeException e)
         {
           throw new RuntimeException(e);
-        }      
+        }
       }
     }
- 
+
     @Override
     public GenericRecord getFinal()
     {
-      if (vSchema == null) 
+      if (vSchema == null)
         vSchema = new Schema.Parser().parse(valueSchemaString);
-      
+
       GenericRecord output = new GenericData.Record(vSchema);
-      
+
       try
       {
-        ByteBuffer bytes = 
+        ByteBuffer bytes =
           ByteBuffer.wrap(estimator.getBytes());
         output.put("data", bytes);
         output.put("count", estimator.cardinality());
@@ -379,28 +381,30 @@ Accumulator<GenericRecord,GenericRecord> accumulator =
       }
       return output;
     }
- 
+
     @Override
     public void cleanup()
     {
       estimator = null;
-    }      
+    }
   };
 ```
 
 So there you have it. With the mapper and accumulator now defined, it is just a matter of passing them to the jobs and providing some other configuration. The key piece is to ensure that the second job uses a 30 day sliding window:
 
 ```java
-PartitionCollapsingIncrementalJob job2 = 
-  new PartitionCollapsingIncrementalJob(Example.class);    
- 
+PartitionCollapsingIncrementalJob job2 =
+  new PartitionCollapsingIncrementalJob(Example.class);
+
 // ...
- 
+
 job2.setNumDays(30); // 30 day sliding window
 ```
 
 ## Try it yourself!
 
+_Update (10/15/2015): Please see the updated version of these instructions at [Getting Started](/docs/hourglass/getting-started.html), which have changed significantly.  The instructions below will not work with the current code base, which has moved to Apache._
+
 Here is how you can start using Hourglass. We'll test out the job from the first example against some test data we'll create in a Hadoop. First, clone the DataFu repository and navigate to the Hourglass directory:
 
     git clone https://github.com/linkedin/datafu.git
@@ -456,4 +460,4 @@ If you're interested in the project, we also encourage you to try running the un
 
 ## Conclusion
 
-We hope this whets your appetite for incremental data processing with DataFu's Hourglass. The [code](https://github.com/linkedin/datafu/tree/master/contrib/hourglass) is available on Github in the [DataFu](https://github.com/linkedin/datafu) repository under an Apache 2.0 license. Documentation is available [here](/docs/hourglass/javadoc.html). We are accepting contributions, so if you are interesting in helping out, please fork the code and send us your pull requests!
\ No newline at end of file
+We hope this whets your appetite for incremental data processing with DataFu's Hourglass. The [code](https://github.com/apache/incubator-datafu/tree/master/datafu-hourglass) is available on Github in the [DataFu](https://github.com/apache/incubator-datafu) repository under an Apache 2.0 license. Documentation is available [here](/docs/hourglass/javadoc.html). We are accepting contributions, so if you are interesting in helping out, please fork the code and send us your pull requests!
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/community/contributing.html.markdown
----------------------------------------------------------------------
diff --git a/site/source/community/contributing.html.markdown b/site/source/community/contributing.html.markdown
new file mode 100644
index 0000000..f25a17c
--- /dev/null
+++ b/site/source/community/contributing.html.markdown
@@ -0,0 +1,83 @@
+---
+title: Contributing - Community
+section_name: Community
+license: >
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+---
+
+# Contributing
+
+We welcome contributions to the Apache DataFu.  If you're interested, please read the following guide:
+
+https://cwiki.apache.org/confluence/display/DATAFU/Contributing+to+Apache+DataFu
+
+## Working in the Code Base
+
+Common tasks for working in the DataFu code can be found below.  For information on how to contribute patches, please
+follow the wiki link above.
+
+### Get the Code
+
+If you haven't done so already:
+
+    git clone https://git-wip-us.apache.org/repos/asf/incubator-datafu.git
+    cd incubator-datafu
+
+### Generate Eclipse Files
+
+The following command generates the necessary files to load the project in Eclipse:
+
+    ./gradlew eclipse
+
+To clean up the eclipse files:
+
+    ./gradlew cleanEclipse
+
+Note that you may run out of heap when executing tests in Eclipse.  To fix this adjust your heap settings for the TestNG plugin.  Go to Eclipse->Preferences.  Select TestNG->Run/Debug.  Add "-Xmx1G" to the JVM args.
+
+### Building
+
+All the JARs for the project can be built with the following command:
+
+    ./gradlew assemble
+
+This builds SNAPSHOT versions of the JARs for both DataFu Pig and Hourglass.  The built JARs can be found under `datafu-pig/build/libs` and `datafu-hourglass/build/libs`, respectively.
+
+The Apache DataFu Pig library can be built by running the command below.
+
+    ./gradlew :datafu-pig:assemble
+    ./gradlew :datafu-hourglass:assemble
+
+### Running Tests
+
+Tests can be run with the following command:
+
+    ./gradlew test
+
+All the tests can also be run from within eclipse.
+
+To run the DataFu Pig or Hourglass tests specifically:
+
+    ./gradlew :datafu-pig:test
+    ./gradlew :datafu-hourglass:test
+
+To run a specific set of tests from the command line, you can define the `test.single` system property with a value matching the test class you want to run.  For example, to run all tests defined in the `QuantileTests` test class for DataFu Pig:
+
+    ./gradlew :datafu-pig:test -Dtest.single=QuantileTests
+
+You can similarly run a specific Hourglass test like so:
+
+    ./gradlew :datafu-hourglass:test -Dtest.single=PartitionCollapsingTests

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/community/mailing-lists.html.markdown
----------------------------------------------------------------------
diff --git a/site/source/community/mailing-lists.html.markdown b/site/source/community/mailing-lists.html.markdown
index 4c4476f..3e40b68 100644
--- a/site/source/community/mailing-lists.html.markdown
+++ b/site/source/community/mailing-lists.html.markdown
@@ -1,5 +1,6 @@
 ---
-title: Mailing Lists - Apache DataFu Community
+title: Mailing Lists - Community
+section_name: Community
 license: >
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/datafu/contributing.html.markdown
----------------------------------------------------------------------
diff --git a/site/source/docs/datafu/contributing.html.markdown b/site/source/docs/datafu/contributing.html.markdown
deleted file mode 100644
index 1b70219..0000000
--- a/site/source/docs/datafu/contributing.html.markdown
+++ /dev/null
@@ -1,68 +0,0 @@
----
-title: Contributing - Apache DataFu Pig
-section_name: Apache DataFu Pig
-license: >
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
----
-
-# Contributing
-
-We welcome contributions to the Apache DataFu Pig library!  Please read the following guide on how to contribute to DataFu.  
-
-https://cwiki.apache.org/confluence/display/DATAFU/Contributing+to+Apache+DataFu
-
-## Common Tasks
-
-Common tasks for working with the DataFu Pig code can be found below.  For information on how to contribute patches, please
-follow the wiki link above.
-
-### Get the Code
-
-To clone the repository run the following command:
-
-    git clone git://git.apache.org/incubator-datafu.git
-
-### Generate Eclipse Files
-
-The following command generates the necessary files to load the project in Eclipse:
-
-    ./gradlew eclipse
-
-To clean up the eclipse files:
-
-    ./gradlew cleanEclipse
-
-Note that you may run out of heap when executing tests in Eclipse.  To fix this adjust your heap settings for the TestNG plugin.  Go to Eclipse->Preferences.  Select TestNG->Run/Debug.  Add "-Xmx1G" to the JVM args.
-
-### Build the JAR
-
-The Apache DataFu Pig library can be built by running the command below. 
-
-    ./gradlew assemble
-
-The built JAR can be found under `datafu-pig/build/libs` by the name `datafu-pig-x.y.z.jar`, where x.y.z is the version.
-    
-### Running Tests
-
-All the tests can be run from within eclipse.  However they can also be run from the command line.  To run all the tests:
-
-    ./gradlew test
-
-To run a specific set of tests from the command line, you can define the `test.single` system property.  For example, to run all tests defined in `QuantileTests`:
-
-    ./gradlew :datafu-pig:test -Dtest.single=QuantileTests
-
-

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/datafu/getting-started.html.markdown.erb
----------------------------------------------------------------------
diff --git a/site/source/docs/datafu/getting-started.html.markdown.erb b/site/source/docs/datafu/getting-started.html.markdown.erb
index f147b33..399dccb 100644
--- a/site/source/docs/datafu/getting-started.html.markdown.erb
+++ b/site/source/docs/datafu/getting-started.html.markdown.erb
@@ -1,6 +1,7 @@
 ---
 title: Getting Started - Apache DataFu Pig
 version: 1.2.0
+snapshot_version: 1.3.0-SNAPSHOT
 section_name: Apache DataFu Pig
 license: >
    Licensed to the Apache Software Foundation (ASF) under one or more
@@ -76,35 +77,11 @@ Apache DataFu Pig is a collection of user-defined functions for working with lar
 If you'd like to read more details about these functions, check out the [Guide](/docs/datafu/guide.html).  Otherwise if you are
 ready to get started using DataFu Pig, keep reading.
 
-## Download
-
-DataFu Pig is available as a JAR that can be downloaded and registered with Pig.  It can be found in the Maven central repository
-under the group ID [com.linkedin.datafu](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.linkedin.datafu%22) by the
-name `datafu`.
-
-If you are using Ivy, you can download `datafu` and its dependencies with:
-
-```xml
-<dependency org="com.linkedin.datafu" name="datafu" rev="<%= current_page.data.version %>"/>
-```
-
-Or if you are using Maven:
-
-```xml
-<dependency>
-  <groupId>com.linkedin.datafu</groupId>
-  <artifactId>datafu</artifactId>
-  <version><%= current_page.data.version %></version>
-</dependency>
-```
-
-Your other option is to [download](https://github.com/linkedin/datafu/archive/master.zip) the code and build the JAR yourself.
-After unzipping the archive you can build the JAR by running `ant jar`.  The dependencies will be 
-downloaded to `lib/common`.
+The rest of this page assumes you already have a built JAR available.  If this is not the case, please see [Quick Start](/docs/quick-start.html).
 
 ## Basic Example: Computing Median
 
-Now that we have downloaded DataFu, let's use it to perform a very basic task: computing the median of some data.
+Let's use DataFu Pig to perform a very basic task: computing the median of some data.
 Suppose we have a file `input` in Hadoop with the following content:
 
     1
@@ -122,7 +99,7 @@ We can clearly see that the median is 2 for this data set.  First we'll start up
 then register the DataFu JAR:
 
 ```pig
-register datafu-<%= current_page.data.version %>.jar
+register datafu-<%= current_page.data.snapshot_version %>.jar
 ```
 
 To compute the median we'll use DataFu's `StreamingMedian`, which computes an estimate of the median but has the benefit
@@ -146,4 +123,4 @@ This produces the expected output:
 
 ## Next Steps
 
-Check out the [Guide](/docs/datafu/guide.html) for more information on what you can do with DataFu.
\ No newline at end of file
+Check out the [Guide](/docs/datafu/guide.html) for more information on what you can do with DataFu Pig.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/datafu/guide.html.markdown.erb
----------------------------------------------------------------------
diff --git a/site/source/docs/datafu/guide.html.markdown.erb b/site/source/docs/datafu/guide.html.markdown.erb
index c345262..462286c 100644
--- a/site/source/docs/datafu/guide.html.markdown.erb
+++ b/site/source/docs/datafu/guide.html.markdown.erb
@@ -35,13 +35,13 @@ It has a number of useful functions available.  This guide provides examples of
 * [More Tips and Tricks](/docs/datafu/guide/more-tips-and-tricks.html)
 
 There is also [Javadoc](/docs/datafu/javadoc.html) available for all UDFs in the library.  We continue to add
-UDFs to the library.  If you are interested in helping out please follow the [Contributing](/docs/datafu/contributing.html)
+UDFs to the library.  If you are interested in helping out please follow the [Contributing](/community/contributing.html)
 guide.
 
 ## Pig Compatibility
 
 The current version of DataFu has been tested against Pig 0.11.1 and 0.12.0.  DataFu should be compatible with some older versions of Pig, however we do not do any sort of testing with prior versions of Pig and do not guarantee compatibility.
-Our policy is to test against the most recent version of Pig whenever we release and make sure DataFu works with that version. 
+Our policy is to test against the most recent version of Pig whenever we release and make sure DataFu works with that version.
 
 ## Blog Posts
 

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/datafu/javadoc.html.markdown.erb
----------------------------------------------------------------------
diff --git a/site/source/docs/datafu/javadoc.html.markdown.erb b/site/source/docs/datafu/javadoc.html.markdown.erb
index 8d2185b..85c34db 100644
--- a/site/source/docs/datafu/javadoc.html.markdown.erb
+++ b/site/source/docs/datafu/javadoc.html.markdown.erb
@@ -21,7 +21,7 @@ license: >
 
 # Javadoc
 
-The latest released version is [<%= current_page.data.latest %>](/docs/datafu/<%= current_page.data.latest %>/).
+The latest Javadocs available are for release [<%= current_page.data.latest %>](/docs/datafu/<%= current_page.data.latest %>/).
 
 Older versions:
 

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/hourglass/contributing.html.markdown
----------------------------------------------------------------------
diff --git a/site/source/docs/hourglass/contributing.html.markdown b/site/source/docs/hourglass/contributing.html.markdown
deleted file mode 100644
index 3ca38dc..0000000
--- a/site/source/docs/hourglass/contributing.html.markdown
+++ /dev/null
@@ -1,41 +0,0 @@
----
-title: Contributing - Apache DataFu Hourglass
-section_name: Apache DataFu Hourglass
-license: >
-   Licensed to the Apache Software Foundation (ASF) under one or more
-   contributor license agreements.  See the NOTICE file distributed with
-   this work for additional information regarding copyright ownership.
-   The ASF licenses this file to You under the Apache License, Version 2.0
-   (the "License"); you may not use this file except in compliance with
-   the License.  You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
----
-
-# Contributing
-
-We welcome contributions to the Apache DataFu Hourglass library!  Please read the following guide on how to contribute to DataFu.  
-
-https://cwiki.apache.org/confluence/display/DATAFU/Contributing+to+Apache+DataFu
-
-## Common Tasks
-
-Common tasks for working with the DataFu Hourglass code can be found below.  For information on how to contribute patches, please
-follow the wiki link above.
-
-### Build the JAR
-
-    cd contrib/hourglass
-    ant jar
-
-### Running Tests
-
-All the tests can be run from within eclipse.  However they can also be run from the command line.  To run all the tests:
-
-    ant test
\ No newline at end of file


Mime
View raw message