datafu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mha...@apache.org
Subject [1/2] incubator-datafu git commit: Update documentation and fix various issues
Date Wed, 21 Oct 2015 16:55:27 GMT
Repository: incubator-datafu
Updated Branches:
  refs/heads/master 643543706 -> 87f55b425


http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/hourglass/getting-started.html.markdown.erb
----------------------------------------------------------------------
diff --git a/site/source/docs/hourglass/getting-started.html.markdown.erb b/site/source/docs/hourglass/getting-started.html.markdown.erb
index eb44f3c..d9740b6 100644
--- a/site/source/docs/hourglass/getting-started.html.markdown.erb
+++ b/site/source/docs/hourglass/getting-started.html.markdown.erb
@@ -2,6 +2,7 @@
 title: Getting Started - Apache DataFu Hourglass
 section_name: Apache DataFu Hourglass
 version: 0.1.3
+snapshot_version: 1.3.0-SNAPSHOT
 license: >
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
@@ -28,38 +29,15 @@ A typical example of a sliding window is a dashboard that shows the number
of vi
 To keep this dashboard up to date, we can schedule a query that runs daily and gathers the
stats for the last 30 days.
 However, this simple implementation would be wasteful: only one day of data has changed,
but we'd be consuming and recalculating
 the stats for all 30.  A more efficient solution is to make the query incremental: using
basic arithmetic, we can update the output
-from the previous day by adding and subtracting input data. This enables the job to process
only the new data, significantly reducing 
+from the previous day by adding and subtracting input data. This enables the job to process
only the new data, significantly reducing
 the computational resources required.
 
 Hourglass is a framework that makes it much easier to write incremental Hadoop jobs to perform
this type of computation efficiently.
 It provides incremental jobs that abstract away the complexity of implementing a robust incremental
solution, with the appropriate hooks
 so that developers can supply custom logic to perform the aggregation task.
 
-## Download
-
-DataFu's Hourglass library is available as a JAR that can be found in the Maven central repository
-under the group ID [com.linkedin.datafu](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.linkedin.datafu%22)
by the
-name `datafu-hourglass`.
-
-If you are using Ivy, you can download `datafu-hourglass` and its dependencies with:
-
-```xml
-<dependency org="com.linkedin.datafu" name="datafu-hourglass" rev="<%= current_page.data.version
%>"/>
-```
-
-Or if you are using Maven:
-
-```xml
-<dependency>
-  <groupId>com.linkedin.datafu</groupId>
-  <artifactId>datafu-hourglass</artifactId>
-  <version><%= current_page.data.version %></version>
-</dependency>
-```
-
-Your other option is to [download](https://github.com/linkedin/datafu/archive/master.zip)
the code and build the JAR yourself.
-After unzipping the archive, navigate to `contrib/hourglass`  and build the JAR by running
`ant jar`.  The dependencies will be 
-downloaded to `lib/common`.
+If you have not already checked out the code, please see: [Quick Start](/docs/quick-start.html).
 There you'll also see how to declare a
+dependency on Hourglass in a project.
 
 ## Examples
 
@@ -72,25 +50,26 @@ to compute daily, such as how many times each user has logged in or viewed
a pag
 These sample jobs are packaged in a tool that can be run from the command line.  The same
tool also serves as a test data generator.
 Here we will walk through how to generate test data and run the jobs against it.
 
-To start, get the source code and navigate to the Hourglass directory.
+Build the main Hourglass JAR, build the test JAR, and copy the dependencies necessary for
the demo to a single directory:
 
-    git clone git://git.apache.org/incubator-datafu.git
-    cd contrib/hourglass
+    ./gradlew :datafu-hourglass:jar :datafu-hourglass:testJar :datafu-hourglass:copyDemoDependencies
 
-Build the Hourglass JAR, and in addition build the test jar that contains the sample jobs.
+Define some variables that we'll need for the `hadoop jar` command. These list the JAR dependencies,
as well as the JARs we just built.
 
-    ant jar
-    ant testjar
+    export LIBJARS=$(find "datafu-hourglass/build/libs" -name '*.jar' | xargs echo | tr '
' ',')
 
-Define some variables that we'll need for the `hadoop jar` command later. These list the
JAR dependencies, as well as the two JARs we just built.
+    export LIBJARS=$LIBJARS,$(find "datafu-hourglass/build/demo_dependencies" -name '*.jar'
| xargs echo | tr ' ' ',')
 
-    export LIBJARS=$(find "lib/common" -name '*.jar' | xargs echo | tr ' ' ',')
-    export LIBJARS=$LIBJARS,$(find "lib/test" -name '*.jar' | xargs echo | tr ' ' ',')
-    export LIBJARS=$LIBJARS,$(find "build" -name '*.jar' | xargs echo | tr ' ' ',')
     export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
 
 Assuming you've set up the `hadoop` command to run against your Hadoop cluster, you are now
ready to run the jobs.
 
+Let's define some shorthand commands to run the Hourglass JAR and dump JSON from an Avro
file:
+
+    export HOURGLASS_CMD="hadoop jar datafu-hourglass/build/libs/datafu-hourglass-incubating-<%=
current_page.data.snapshot_version %>-tests.jar datafu.hourglass.demo.Main"
+
+    export TO_JSON_CMD="java -jar datafu-hourglass/build/demo_dependencies/avro-tools-1.7.4.jar
tojson"
+
 ### Counting Events
 
 In this example we will run a job that counts how many times each `id` value has appeared
in an input data set.  Then we will run the
@@ -100,12 +79,12 @@ First we'll generate some test data under the path`/data/event` using
the `gener
 The command below will create some random events for dates between 2013/03/01 and 2013/03/14,
inclusive.
 Each record consists of just a single long value from the range 1-100.
 
-    hadoop jar build/datafu-hourglass-test.jar generate -libjars ${LIBJARS} /data/event 2013/03/01-2013/03/14
+    $HOURGLASS_CMD generate -libjars ${LIBJARS} /data/event 2013/03/01-2013/03/14
 
 Just to get a sense for what the data looks like, we can copy it locally and dump the first
several records.
 
     hadoop fs -copyToLocal /data/event/2013/03/01/part-00000.avro temp.avro
-    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | head
+    $TO_JSON_CMD temp.avro | head
 
 This will produce output looking something like this:
 
@@ -120,10 +99,10 @@ This will produce output looking something like this:
     {"id":6}
     {"id":44}
 
-Now run the `countbyid` command, which executes the sample [CountById](https://github.com/linkedin/datafu/blob/master/contrib/hourglass/test/java/datafu/hourglass/demo/CountById.java)
job.
+Now run the `countbyid` command, which executes the sample [CountById](https://github.com/apache/incubator-datafu/blob/master/datafu-hourglass/src/test/java/datafu/hourglass/demo/CountById.java)
job.
 This will count the number of events for each ID value.
 
-    hadoop jar build/datafu-hourglass-test.jar countbyid -libjars ${LIBJARS} /data/event
/output
+    $HOURGLASS_CMD countbyid -libjars ${LIBJARS} /data/event /output
 
 In the console output you will notice that it reads all fourteen days of input that are available.
 We can see what this produced by copying the output locally and dumping the first several
records.
@@ -131,7 +110,7 @@ Each record consists of an ID and a count.
 
     rm temp.avro
     hadoop fs -copyToLocal /output/20130314/part-r-00000.avro temp.avro
-    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | head
+    $TO_JSON_CMD temp.avro | head
 
 This will produce output looking something like this:
 
@@ -148,13 +127,13 @@ This will produce output looking something like this:
 
 Now let's generate an additional day of data, for 2013/03/15:
 
-    hadoop jar build/datafu-hourglass-test.jar generate -libjars ${LIBJARS} /data/event 2013/03/15
+    $HOURGLASS_CMD generate -libjars ${LIBJARS} /data/event 2013/03/15
 
 The job is configured to consume all available input data.  But since a previous output already
exists,
 it is able to reuse this result and therefore it only needs to consume the previous output
and the new
 day of input.  Let's run the incremental job again:
 
-    hadoop jar build/datafu-hourglass-test.jar countbyid -libjars ${LIBJARS} /data/event
/output
+    $HOURGLASS_CMD countbyid -libjars ${LIBJARS} /data/event /output
 
 You'll notice in console output that the job considers two alternative plans.  In one version
it consumes
 all available input data to produce the new output.  In the other version it reuses the previous
output
@@ -165,9 +144,9 @@ We can download the new output and inspect the counts:
 
     rm temp.avro
     hadoop fs -copyToLocal /output/20130315/part-r-00000.avro temp.avro
-    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | head
+    $TO_JSON_CMD temp.avro | head
 
-The implementation of the `CountById` job can be found [here](https://github.com/linkedin/datafu/blob/master/contrib/hourglass/test/java/datafu/hourglass/demo/CountById.java).
+The implementation of the `CountById` job can be found [here](https://github.com/apache/incubator-datafu/blob/master/datafu-hourglass/src/test/java/datafu/hourglass/demo/CountById.java).
 A more detailed explanation of how the job works and how it is implemented can be found in
our
 [blog post](/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html).
 
@@ -185,11 +164,11 @@ Let's start by cleaning up the output directory:
 
     hadoop fs -rmr /output
 
-If you have been following along from the previous example, you already have fifteen days
of input data available, 
+If you have been following along from the previous example, you already have fifteen days
of input data available,
 so we don't need to regenerate it.  We can run the `cardinality` command to execute the two
jobs.  This executes
-the sample jobs in [EstimateCardinality](https://github.com/linkedin/datafu/blob/master/contrib/hourglass/test/java/datafu/hourglass/demo/EstimateCardinality.java).
+the sample jobs in [EstimateCardinality](https://github.com/apache/incubator-datafu/blob/master/datafu-hourglass/src/test/java/datafu/hourglass/demo/EstimateCardinality.java).
 
-    hadoop jar build/datafu-hourglass-test.jar cardinality -libjars ${LIBJARS} /data/event
/intermediate /output 15
+    $HOURGLASS_CMD cardinality -libjars ${LIBJARS} /data/event /intermediate /output 15
 
 You will notice in the console output that the job consumes fifteen days of input.  We can
then inspect the output to
 see the count of distinct IDs.  Note that the output record consists of the count *and* the
serialized HyperLogLog estimator,
@@ -197,27 +176,27 @@ so we use `grep` to return just the count.
 
     rm temp.avro
     hadoop fs -copyToLocal /output/20130315/part-r-00000.avro temp.avro
-    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | grep -E -oh "\"count\":[0-9]+"
+    $TO_JSON_CMD temp.avro | grep -E -oh "\"count\":[0-9]+"
 
 As the IDs in the test data are generated from the range 1-100, this produces the expected
output of 100.
-We'll add a new day of data, but this time we'll use the range 101-200.  
+We'll add a new day of data, but this time we'll use the range 101-200.
 
-    hadoop jar build/datafu-hourglass-test.jar generate -libjars ${LIBJARS} /data/event 2013/03/16
101-200
+    $HOURGLASS_CMD generate -libjars ${LIBJARS} /data/event 2013/03/16 101-200
 
 Now we'll run the job again.  It automatically consumes fifteen days of data ending with
the most recent data that's
 available.
 
-    hadoop jar build/datafu-hourglass-test.jar cardinality -libjars ${LIBJARS} /data/event
/intermediate /output 15
+    $HOURGLASS_CMD cardinality -libjars ${LIBJARS} /data/event /intermediate /output 15
 
 We can now inspect the output again:
 
     rm temp.avro
     hadoop fs -copyToLocal /output/20130316/part-r-00000.avro temp.avro
-    java -jar lib/test/avro-tools-jar-1.7.4.jar tojson temp.avro | grep -E -oh "\"count\":[0-9]+"
+    $TO_JSON_CMD temp.avro | grep -E -oh "\"count\":[0-9]+"
 
 This produces the expected result of 200.
 
-The implementation of the `EstimateCardinality` job can be found [here](https://github.com/linkedin/datafu/blob/master/contrib/hourglass/test/java/datafu/hourglass/demo/EstimateCardinality.java).
+The implementation of the `EstimateCardinality` job can be found [here](https://github.com/apache/incubator-datafu/blob/master/datafu-hourglass/src/test/java/datafu/hourglass/demo/EstimateCardinality.java).
 A more detailed explanation of how the job works and how it is implemented can be found in
our
 [blog post](/blog/2013/10/03/datafus-hourglass-incremental-data-processing-in-hadoop.html).
 

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/hourglass/javadoc.html.markdown.erb
----------------------------------------------------------------------
diff --git a/site/source/docs/hourglass/javadoc.html.markdown.erb b/site/source/docs/hourglass/javadoc.html.markdown.erb
index 6677e1d..b1a47e1 100644
--- a/site/source/docs/hourglass/javadoc.html.markdown.erb
+++ b/site/source/docs/hourglass/javadoc.html.markdown.erb
@@ -21,4 +21,4 @@ license: >
 
 # Javadoc
 
-The latest released version is [<%= current_page.data.latest %>](/docs/hourglass/<%=
current_page.data.latest %>/).
\ No newline at end of file
+The latest Javadocs available are for release [<%= current_page.data.latest %>](/docs/hourglass/<%=
current_page.data.latest %>/).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/docs/quick-start.html.markdown.erb
----------------------------------------------------------------------
diff --git a/site/source/docs/quick-start.html.markdown.erb b/site/source/docs/quick-start.html.markdown.erb
new file mode 100644
index 0000000..ddb0458
--- /dev/null
+++ b/site/source/docs/quick-start.html.markdown.erb
@@ -0,0 +1,89 @@
+---
+title: Quick Start - Apache DataFu
+section_name: Apache DataFu
+version: 1.2.0
+next_version: 1.3.0
+snapshot_version: 1.3.0-SNAPSHOT
+license: >
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+---
+
+# Quick Start
+
+Clone the repository with the following command:
+
+    git clone https://git-wip-us.apache.org/repos/asf/incubator-datafu.git
+    cd incubator-datafu
+
+## Build
+
+To build the JARs, run:
+
+    ./gradlew assemble
+
+This will produce snapshot JARs for the next upcoming release, `<%= current_page.data.next_version
%>`.  There is not _yet_ an official Apache release for DataFu, so for the moment we recommend
building the latest version from HEAD.
+
+### DataFu Pig
+
+After building, the DataFu Pig artifacts can be found in `datafu-pig/build/libs`.  This should
contain:
+
+* `datafu-pig-incubating-<%= current_page.data.snapshot_version %>.jar`
+* `datafu-pig-incubating-<%= current_page.data.snapshot_version %>-javadoc.jar`
+* `datafu-pig-incubating-<%= current_page.data.snapshot_version %>-sources.jar`
+
+The `datafu-pig-incubating-<%= current_page.data.snapshot_version %>.jar` file can
now be used in Pig!
+
+See [DataFu Pig - Getting Started](/docs/datafu/getting-started.html) for next steps.
+
+### DataFu Hourglass
+
+After building, the DataFu Hourglass artifacts can be found in `datafu-hourglass/build/libs`.
 This should contain:
+
+* `datafu-hourglass-incubating-<%= current_page.data.snapshot_version %>.jar`
+* `datafu-hourglass-incubating-<%= current_page.data.snapshot_version %>-javadoc.jar`
+* `datafu-hourglass-incubating-<%= current_page.data.snapshot_version %>-sources.jar`
+
+DataFu Hourglass has several external library dependencies that are required in order to
use it.  Therefore, the easiest way to get started using it is to install DataFu to your local
maven repository:
+
+    ./gradlew install
+
+Assuming your local maven repository is at `~/.m2`, you should see the DataFu Hourglass libraries
under `~/.m2/repository/org/apache/datafu/datafu-hourglass-incubating/<%= current_page.data.snapshot_version
%>`.
+
+You should now be able to declare a dependency on DataFu Hourglass.
+
+Gradle:
+
+```groovy
+compile "org.apache.datafu:datafu-hourglass-incubating:<%= current_page.data.snapshot_version
%>"
+```
+
+Ivy:
+
+```xml
+<dependency org="org.apache.datafu" name="datafu-hourglass-incubating" rev="<%= current_page.data.snapshot_version
%>"/>
+```
+
+Maven:
+
+```xml
+<dependency>
+  <groupId>org.apache.datafu</groupId>
+  <artifactId>datafu-hourglass-incubating</artifactId>
+  <version><%= current_page.data.snapshot_version %></version>
+</dependency>
+```
+
+See [DataFu Hourglass - Getting Started](/docs/hourglass/getting-started.html) for next steps.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/index.markdown.erb
----------------------------------------------------------------------
diff --git a/site/source/index.markdown.erb b/site/source/index.markdown.erb
index 05b582e..ca8aa3a 100644
--- a/site/source/index.markdown.erb
+++ b/site/source/index.markdown.erb
@@ -19,7 +19,7 @@ license: >
 
 # Apache DataFu
 
-Apache DataFu&trade; is a collection of libraries for working with large-scale data in
Hadoop. 
+Apache DataFu&trade; is a collection of libraries for working with large-scale data in
Hadoop.
 The project was inspired by the need for stable, well-tested libraries for data mining and
statistics.
 
 It consists of two libraries:
@@ -27,6 +27,10 @@ It consists of two libraries:
 * **Apache DataFu Pig**: a collection of user-defined functions for [Apache Pig](http://pig.apache.org/)
 * **Apache DataFu Hourglass**: an incremental processing framework for [Apache Hadoop](http://hadoop.apache.org/)
in MapReduce
 
+To begin using it, see our [Quick Start](/docs/quick-start.html) guide.  If you'd like to
help contribute, see [Contributing](/community/contributing.html).
+
+## About the Project
+
 ### Apache DataFu Pig
 
 Apache DataFu Pig is a collection of useful user-defined functions for data analysis in [Apache
Pig](http://pig.apache.org/).

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/layouts/_docs_nav.erb
----------------------------------------------------------------------
diff --git a/site/source/layouts/_docs_nav.erb b/site/source/layouts/_docs_nav.erb
index 7c0304a..b5f8dc0 100644
--- a/site/source/layouts/_docs_nav.erb
+++ b/site/source/layouts/_docs_nav.erb
@@ -17,12 +17,17 @@
 # under the License.
 %>
 
+<h4>Apache DataFu</h4>
+<ul class="nav nav-pills nav-stacked">
+  <li><a href="/">Home</a></li>
+  <li><a href="/docs/quick-start.html">Quick Start</a></li>
+</ul>
+
 <h4>Apache DataFu Pig</h4>
 <ul class="nav nav-pills nav-stacked">
   <li><a href="/docs/datafu/getting-started.html">Getting Started</a></li>
   <li><a href="/docs/datafu/guide.html">Guide</a></li>
   <li><a href="/docs/datafu/javadoc.html">Javadoc</a></li>
-  <li><a href="/docs/datafu/contributing.html">Contributing</a></li>
 </ul>
 
 <h4>Apache DataFu Hourglass</h4>
@@ -30,11 +35,11 @@
   <li><a href="/docs/hourglass/getting-started.html">Getting Started</a></li>
   <li><a href="/docs/hourglass/concepts.html">Concepts</a></li>
   <li><a href="/docs/hourglass/javadoc.html">Javadoc</a></li>
-  <li><a href="/docs/hourglass/contributing.html">Contributing</a></li>
 </ul>
 
 <h4>Community</h4>
 <ul class="nav nav-pills nav-stacked">
+  <li><a href="/community/contributing.html">Contributing</a></li>
   <li><a href="/community/mailing-lists.html">Mailing Lists</a></li>
   <li><a href="https://issues.apache.org/jira/browse/DATAFU">Bugs</a></li>
 </ul>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-datafu/blob/87f55b42/site/source/layouts/_footer.erb
----------------------------------------------------------------------
diff --git a/site/source/layouts/_footer.erb b/site/source/layouts/_footer.erb
index eb63698..085ac21 100644
--- a/site/source/layouts/_footer.erb
+++ b/site/source/layouts/_footer.erb
@@ -18,6 +18,6 @@
 %>
 
 <div class="footer">
-Copyright &copy; 2011-2014 <a href="http://www.apache.org/licenses/">The Apache
Software Foundation</a>. <br>
+Copyright &copy; 2011-2015 <a href="http://www.apache.org/licenses/">The Apache
Software Foundation</a>. <br>
 Apache DataFu, DataFu, Apache Pig, Apache Hadoop, Hadoop, Apache, and the Apache feather
logo are either registered trademarks or trademarks of the Apache Software Foundation in the
United States and other countries.
 </div>
\ No newline at end of file


Mime
View raw message