datafu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From e...@apache.org
Subject [datafu] branch spark-tmp updated: Minor code review changes
Date Thu, 30 May 2019 13:55:12 GMT
This is an automated email from the ASF dual-hosted git repository.

eyal pushed a commit to branch spark-tmp
in repository https://gitbox.apache.org/repos/asf/datafu.git


The following commit(s) were added to refs/heads/spark-tmp by this push:
     new b56109e  Minor code review changes
b56109e is described below

commit b56109eb6a7e9a12129930a3a562e71f705efdc2
Author: Eyal Allweil <eyal@apache.org>
AuthorDate: Thu May 30 16:54:47 2019 +0300

    Minor code review changes
---
 datafu-spark/README.md                             |  18 +-
 datafu-spark/build.gradle                          |   2 -
 datafu-spark/build_and_test_spark.sh               |  16 +-
 datafu-spark/src/main/resources/META-INF/LICENSE   | 191 ---------------------
 datafu-spark/src/main/resources/META-INF/NOTICE    |  56 +-----
 .../test/resources/python_tests/df_utils_tests.py  |   2 +
 .../src/test/resources/python_tests/pyfromscala.py |   4 +-
 .../python_tests/pyfromscala_with_error.py         |   2 +
 8 files changed, 30 insertions(+), 261 deletions(-)

diff --git a/datafu-spark/README.md b/datafu-spark/README.md
index 55f8859..cb5b7ed 100644
--- a/datafu-spark/README.md
+++ b/datafu-spark/README.md
@@ -2,11 +2,21 @@
 
 datafu-spark contains a number of spark API's and a "Scala-Python bridge" that makes calling
Scala code from Python, and vice-versa, easier.
 
-It has been tested on Spark releases from with 2.1.0 to 2.4.0, using Scala 2.10 and 2.11.
+Here are some examples of things you can do with it:
+
+* "Dedup" a table - remove duplicates based on a key and ordering (typically a date updated
field, to get only the mostly recently updated record).
+
+* Join a table with a numeric field with a table with a range
+
+* Do a skewed join between tables (where the small table is still too big to fit in memory)
+
+* Count distinct up to - an efficient implementation when you just want to verify that a
certain minimum of distinct rows appear in a table
+
+It has been tested on Spark releases from 2.1.0 to 2.4.0, using Scala 2.10 and 2.11. You
can check if your Spark/Scala version combination has been tested by looking [here.](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/build_and_test_spark.sh#L20)
 
 -----------
 
-In order to call the spark-datafu API's from Pyspark, you can do the following (tested on
a Hortonworks vm)
+In order to call the datafu-spark API's from Pyspark, you can do the following (tested on
a Hortonworks vm)
 
 First, call pyspark with the following parameters
 
@@ -58,9 +68,9 @@ This should produce the following output
 
 # Development
 
-Building and testing spark-datafu can be done as described in the [the main DataFu README](https://github.com/apache/datafu/blob/master/README.md#developers).
+Building and testing datafu-spark can be done as described in the [the main DataFu README](https://github.com/apache/datafu/blob/master/README.md#developers).
 
-There is a [script](https://github.com/apache/datafu/tree/spark-tmp/datafu-spark/build_and_test_spark.sh)
for building and testing spark-datafu across the multiple Scala/Spark combinations.
+There is a [script](https://github.com/apache/datafu/tree/spark-tmp/datafu-spark/build_and_test_spark.sh)
for building and testing datafu-spark across the multiple Scala/Spark combinations.
 
 To see the available options run it like this:
 
diff --git a/datafu-spark/build.gradle b/datafu-spark/build.gradle
index d1897bb..2ac4522 100644
--- a/datafu-spark/build.gradle
+++ b/datafu-spark/build.gradle
@@ -39,8 +39,6 @@ allprojects {
 
 archivesBaseName = 'datafu-spark_' + scalaVersion + '_' + sparkVersion
 
-import groovy.xml.MarkupBuilder
-
 cleanEclipse {
   doLast {
     delete ".apt_generated"
diff --git a/datafu-spark/build_and_test_spark.sh b/datafu-spark/build_and_test_spark.sh
index 212fb8b..8d38b95 100755
--- a/datafu-spark/build_and_test_spark.sh
+++ b/datafu-spark/build_and_test_spark.sh
@@ -17,11 +17,11 @@
 
 #!/bin/bash
 
-export SPARKS_210="2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.2.2"
-export SPARKS_211="2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.2.2 2.3.0 2.3.1 2.3.2 2.4.0"
+export SPARK_VERSIONS_FOR_SCALA_210="2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.2.2"
+export SPARK_VERSIONS_FOR_SCALA_211="2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.2.2 2.3.0 2.3.1
2.3.2 2.4.0"
 
-export LATEST_SPARKS_210="2.1.3 2.2.2"
-export LATEST_SPARKS_211="2.1.3 2.2.2 2.3.2 2.4.0"
+export LATEST_SPARK_VERSIONS_FOR_SCALA_210="2.1.3 2.2.2"
+export LATEST_SPARK_VERSIONS_FOR_SCALA_211="2.1.3 2.2.2 2.3.2 2.4.0"
 
 # No Spark support for Scala 2.12 before 2.4.0, and no spark-testing-base yet
 export SPARKS_212="2.4.0" 
@@ -74,8 +74,8 @@ while getopts "l:j:t:hq" arg; do
                         TEST_PARAMS=$OPTARG
                         ;;
                 q)
-                        SPARKS_210=$LATEST_SPARKS_210
-                        SPARKS_211=$LATEST_SPARKS_211
+                        SPARK_VERSIONS_FOR_SCALA_210=$LATEST_SPARK_VERSIONS_FOR_SCALA_210
+                        SPARK_VERSIONS_FOR_SCALA_211=$LATEST_SPARK_VERSIONS_FOR_SCALA_211
                         ;;
                 h)
                         echo "Builds and tests datafu-spark in multiple Scala/Spark combinations"
@@ -100,12 +100,12 @@ if [[ $JARS_DIR != "NONE" ]]; then
 fi
 
 export scala=2.10
-for spark in $SPARKS_210; do
+for spark in $SPARK_VERSIONS_FOR_SCALA_210; do
 	build
 done
 
 export scala=2.11
-for spark in $SPARKS_211; do
+for spark in $SPARK_VERSIONS_FOR_SCALA_211; do
 	build
 done
 
diff --git a/datafu-spark/src/main/resources/META-INF/LICENSE b/datafu-spark/src/main/resources/META-INF/LICENSE
index 6bd634b..57bc88a 100644
--- a/datafu-spark/src/main/resources/META-INF/LICENSE
+++ b/datafu-spark/src/main/resources/META-INF/LICENSE
@@ -200,194 +200,3 @@
    See the License for the specific language governing permissions and
    limitations under the License.
 
-=======================================================================
-
-APACHE DATAFU-PIG SUBCOMPONENTS:
-
-The Apache DataFu datafu-pig-incubating JAR bundles the following
-Apache-2.0-licensed dependencies:
-
-* it.unimi.dsi:fastutil:6.5.7
-* org.apache.commons:commons-math:2.2
-* com.clearspring.analytics:stream:2.5.0
-* com.google.guava:guava:11.0.2
-* org.apache.opennlp:opennlp-tools:1.5.3
-* org.apache.opennlp:opennlp-uima:1.5.3
-* org.apache.opennlp:opennlp-maxent:3.0.3
-
-
-Contents copied from LICENSE.txt for org.apache.commons:commons-math:2.2:
-
-=======================================================================
-
-APACHE COMMONS MATH DERIVATIVE WORKS:
-
-The Apache commons-math library includes a number of subcomponents
-whose implementation is derived from original sources written
-in C or Fortran.  License terms of the original sources
-are reproduced below.
-
-===============================================================================
-For the lmder, lmpar and qrsolv Fortran routine from minpack and translated in
-the LevenbergMarquardtOptimizer class in package
-org.apache.commons.math.optimization.general
-Original source copyright and license statement:
-
-Minpack Copyright Notice (1999) University of Chicago.  All rights reserved
-
-Redistribution and use in source and binary forms, with or
-without modification, are permitted provided that the
-following conditions are met:
-
-1. Redistributions of source code must retain the above
-copyright notice, this list of conditions and the following
-disclaimer.
-
-2. Redistributions in binary form must reproduce the above
-copyright notice, this list of conditions and the following
-disclaimer in the documentation and/or other materials
-provided with the distribution.
-
-3. The end-user documentation included with the
-redistribution, if any, must include the following
-acknowledgment:
-
-   "This product includes software developed by the
-   University of Chicago, as Operator of Argonne National
-   Laboratory.
-
-Alternately, this acknowledgment may appear in the software
-itself, if and wherever such third-party acknowledgments
-normally appear.
-
-4. WARRANTY DISCLAIMER. THE SOFTWARE IS SUPPLIED "AS IS"
-WITHOUT WARRANTY OF ANY KIND. THE COPYRIGHT HOLDER, THE
-UNITED STATES, THE UNITED STATES DEPARTMENT OF ENERGY, AND
-THEIR EMPLOYEES: (1) DISCLAIM ANY WARRANTIES, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTIES
-OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE
-OR NON-INFRINGEMENT, (2) DO NOT ASSUME ANY LEGAL LIABILITY
-OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR
-USEFULNESS OF THE SOFTWARE, (3) DO NOT REPRESENT THAT USE OF
-THE SOFTWARE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS, (4)
-DO NOT WARRANT THAT THE SOFTWARE WILL FUNCTION
-UNINTERRUPTED, THAT IT IS ERROR-FREE OR THAT ANY ERRORS WILL
-BE CORRECTED.
-
-5. LIMITATION OF LIABILITY. IN NO EVENT WILL THE COPYRIGHT
-HOLDER, THE UNITED STATES, THE UNITED STATES DEPARTMENT OF
-ENERGY, OR THEIR EMPLOYEES: BE LIABLE FOR ANY INDIRECT,
-INCIDENTAL, CONSEQUENTIAL, SPECIAL OR PUNITIVE DAMAGES OF
-ANY KIND OR NATURE, INCLUDING BUT NOT LIMITED TO LOSS OF
-PROFITS OR LOSS OF DATA, FOR ANY REASON WHATSOEVER, WHETHER
-SUCH LIABILITY IS ASSERTED ON THE BASIS OF CONTRACT, TORT
-(INCLUDING NEGLIGENCE OR STRICT LIABILITY), OR OTHERWISE,
-EVEN IF ANY OF SAID PARTIES HAS BEEN WARNED OF THE
-POSSIBILITY OF SUCH LOSS OR DAMAGES.
-===============================================================================
-
-Copyright and license statement for the odex Fortran routine developed by
-E. Hairer and G. Wanner and translated in GraggBulirschStoerIntegrator class
-in package org.apache.commons.math.ode.nonstiff:
-
-
-Copyright (c) 2004, Ernst Hairer
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are
-met:
-
-- Redistributions of source code must retain the above copyright
-notice, this list of conditions and the following disclaimer.
-
-- Redistributions in binary form must reproduce the above copyright
-notice, this list of conditions and the following disclaimer in the
-documentation and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
-IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
-TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR
-CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-===============================================================================
-
-Copyright and license statement for the original lapack fortran routines
-translated in EigenDecompositionImpl class in package
-org.apache.commons.math.linear:
-
-Copyright (c) 1992-2008 The University of Tennessee.  All rights reserved.
-
-$COPYRIGHT$
-
-Additional copyrights may follow
-
-$HEADER$
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are
-met:
-
-- Redistributions of source code must retain the above copyright
-  notice, this list of conditions and the following disclaimer.
-
-- Redistributions in binary form must reproduce the above copyright
-  notice, this list of conditions and the following disclaimer listed
-  in this license in the documentation and/or other materials
-  provided with the distribution.
-
-- Neither the name of the copyright holders nor the names of its
-  contributors may be used to endorse or promote products derived from
-  this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
-OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
-SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
-LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
-THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-===============================================================================
-
-Copyright and license statement for the original Mersenne twister C
-routines translated in MersenneTwister class in package
-org.apache.commons.math.random:
-
-   Copyright (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura,
-   All rights reserved.
-
-   Redistribution and use in source and binary forms, with or without
-   modification, are permitted provided that the following conditions
-   are met:
-
-     1. Redistributions of source code must retain the above copyright
-        notice, this list of conditions and the following disclaimer.
-
-     2. Redistributions in binary form must reproduce the above copyright
-        notice, this list of conditions and the following disclaimer in the
-        documentation and/or other materials provided with the distribution.
-
-     3. The names of its contributors may not be used to endorse or promote
-        products derived from this software without specific prior written
-        permission.
-
-   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-   A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-   CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/datafu-spark/src/main/resources/META-INF/NOTICE b/datafu-spark/src/main/resources/META-INF/NOTICE
index 18fba06..123f612 100644
--- a/datafu-spark/src/main/resources/META-INF/NOTICE
+++ b/datafu-spark/src/main/resources/META-INF/NOTICE
@@ -1,60 +1,6 @@
 Apache DataFu
-Copyright 2010-2017 The Apache Software Foundation
+Copyright 2010-2018 The Apache Software Foundation
 
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
 
-
-Contents copied from NOTICE.txt for org.apache.commons:commons-math:2.2:
-
-===============================================================================
-
-The BracketFinder (package org.apache.commons.math.optimization.univariate)
-and PowellOptimizer (package org.apache.commons.math.optimization.general)
-classes are based on the Python code in module "optimize.py" (version 0.5)
-developed by Travis E. Oliphant for the SciPy library (http://www.scipy.org/)
-Copyright © 2003-2009 SciPy Developers.
-===============================================================================
-
-The LinearConstraint, LinearObjectiveFunction, LinearOptimizer,
-RelationShip, SimplexSolver and SimplexTableau classes in package
-org.apache.commons.math.optimization.linear include software developed by
-Benjamin McCann (http://www.benmccann.com) and distributed with
-the following copyright: Copyright 2009 Google Inc.
-===============================================================================
-
-This product includes software developed by the
-University of Chicago, as Operator of Argonne National
-Laboratory.
-The LevenbergMarquardtOptimizer class in package
-org.apache.commons.math.optimization.general includes software
-translated from the lmder, lmpar and qrsolv Fortran routines
-from the Minpack package
-Minpack Copyright Notice (1999) University of Chicago.  All rights reserved
-===============================================================================
-
-The GraggBulirschStoerIntegrator class in package
-org.apache.commons.math.ode.nonstiff includes software translated
-from the odex Fortran routine developed by E. Hairer and G. Wanner.
-Original source copyright:
-Copyright (c) 2004, Ernst Hairer
-===============================================================================
-
-The EigenDecompositionImpl class in package
-org.apache.commons.math.linear includes software translated
-from some LAPACK Fortran routines.  Original source copyright:
-Copyright (c) 1992-2008 The University of Tennessee.  All rights reserved.
-===============================================================================
-
-The MersenneTwister class in package org.apache.commons.math.random
-includes software translated from the 2002-01-26 version of
-the Mersenne-Twister generator written in C by Makoto Matsumoto and Takuji
-Nishimura. Original source copyright:
-Copyright (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura,
-All rights reserved
-===============================================================================
-
-The complete text of licenses and disclaimers associated with the the original
-sources enumerated above at the time of code translation are in the
-LICENSE.txt [copied into LICENSE for Apache DataFu] file.
-
diff --git a/datafu-spark/src/test/resources/python_tests/df_utils_tests.py b/datafu-spark/src/test/resources/python_tests/df_utils_tests.py
index d393a87..c33a88f 100644
--- a/datafu-spark/src/test/resources/python_tests/df_utils_tests.py
+++ b/datafu-spark/src/test/resources/python_tests/df_utils_tests.py
@@ -15,6 +15,8 @@
 # specific language governing permissions and limitations
 # under the License.
 
+# This file is used by the datafu-spark unit tests
+
 import os
 import sys
 from pprint import pprint as p
diff --git a/datafu-spark/src/test/resources/python_tests/pyfromscala.py b/datafu-spark/src/test/resources/python_tests/pyfromscala.py
index 73939b6..3162ff4 100644
--- a/datafu-spark/src/test/resources/python_tests/pyfromscala.py
+++ b/datafu-spark/src/test/resources/python_tests/pyfromscala.py
@@ -15,7 +15,9 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# Some usage examples of python-Scala functionality
+# Some examples of cross python-Scala functionality
+# This file is used by the datafu-spark unit tests
+
 
 # print the PYTHONPATH
 import sys
diff --git a/datafu-spark/src/test/resources/python_tests/pyfromscala_with_error.py b/datafu-spark/src/test/resources/python_tests/pyfromscala_with_error.py
index fca4138..d784662 100644
--- a/datafu-spark/src/test/resources/python_tests/pyfromscala_with_error.py
+++ b/datafu-spark/src/test/resources/python_tests/pyfromscala_with_error.py
@@ -15,4 +15,6 @@
 # specific language governing permissions and limitations
 # under the License.
 
+# This file is used by the datafu-spark unit tests
+
 sqlContext.sql("select * from edw.table_not_exists")


Mime
View raw message