datafu-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From e...@apache.org
Subject [datafu] branch spark-tmp updated: Fix for Scala/Python bridge to support Spark 2.4.1 - 2.4.3
Date Fri, 14 Jun 2019 19:35:22 GMT
This is an automated email from the ASF dual-hosted git repository.

eyal pushed a commit to branch spark-tmp
in repository https://gitbox.apache.org/repos/asf/datafu.git


The following commit(s) were added to refs/heads/spark-tmp by this push:
     new 642772a  Fix for Scala/Python bridge to support Spark 2.4.1 - 2.4.3
642772a is described below

commit 642772a2488b0051ab57c876577c9794b74a6c00
Author: Eyal Allweil <eyal@apache.org>
AuthorDate: Thu Jun 13 15:30:19 2019 +0300

    Fix for Scala/Python bridge to support Spark 2.4.1 - 2.4.3
---
 datafu-spark/README.md                             | 22 +++++++++++++++-------
 datafu-spark/build_and_test_spark.sh               |  8 ++++----
 .../spark/utils/overwrites/SparkPythonRunner.scala |  1 +
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/datafu-spark/README.md b/datafu-spark/README.md
index cb5b7ed..429f136 100644
--- a/datafu-spark/README.md
+++ b/datafu-spark/README.md
@@ -4,15 +4,15 @@ datafu-spark contains a number of spark API's and a "Scala-Python bridge"
that m
 
 Here are some examples of things you can do with it:
 
-* "Dedup" a table - remove duplicates based on a key and ordering (typically a date updated
field, to get only the mostly recently updated record).
+* ["Dedup" a table](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L139)
- remove duplicates based on a key and ordering (typically a date updated field, to get only
the mostly recently updated record).
 
-* Join a table with a numeric field with a table with a range
+* [Join a table with a numeric field with a table with a range](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L361)
 
-* Do a skewed join between tables (where the small table is still too big to fit in memory)
+* [Do a skewed join between tables](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L274)
(where the small table is still too big to fit in memory)
 
-* Count distinct up to - an efficient implementation when you just want to verify that a
certain minimum of distinct rows appear in a table
+* [Count distinct up to](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala#L224)
- an efficient implementation when you just want to verify that a certain minimum of distinct
rows appear in a table
 
-It has been tested on Spark releases from 2.1.0 to 2.4.0, using Scala 2.10 and 2.11. You
can check if your Spark/Scala version combination has been tested by looking [here.](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/build_and_test_spark.sh#L20)
+It has been tested on Spark releases from 2.1.0 to 2.4.3, using Scala 2.10, 2.11 and 2.12.
You can check if your Spark/Scala version combination has been tested by looking [here.](https://github.com/apache/datafu/blob/spark-tmp/datafu-spark/build_and_test_spark.sh#L20)
 
 -----------
 
@@ -23,7 +23,7 @@ First, call pyspark with the following parameters
 ```bash
 export PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
 
-pyspark  --jars datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
+pyspark --jars datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
 ```
 
 The following is an example of calling the Spark version of the datafu _dedup_ method
@@ -68,7 +68,15 @@ This should produce the following output
 
 # Development
 
-Building and testing datafu-spark can be done as described in the [the main DataFu README](https://github.com/apache/datafu/blob/master/README.md#developers).
+Building and testing datafu-spark can be done as described in the [the main DataFu README.](https://github.com/apache/datafu/blob/master/README.md#developers)
+
+If you wish to build for a specific Scala/Spark version, there are two options. One is to
change the *scalaVersion* and *sparkVersion* in [the main gradle.properties file.](https://github.com/apache/datafu/blob/spark-tmp/gradle.properties#L22)
+
+The other is to pass these parameters in the command line. For example, to build and test
for Scala 2.12 and Spark 2.4.0, you would use
+
+```bash
+./gradlew :datafu-spark:test -PscalaVersion=2.12 -PsparkVersion=2.4.0
+```
 
 There is a [script](https://github.com/apache/datafu/tree/spark-tmp/datafu-spark/build_and_test_spark.sh)
for building and testing datafu-spark across the multiple Scala/Spark combinations.
 
diff --git a/datafu-spark/build_and_test_spark.sh b/datafu-spark/build_and_test_spark.sh
index 180614e..0b876eb 100755
--- a/datafu-spark/build_and_test_spark.sh
+++ b/datafu-spark/build_and_test_spark.sh
@@ -18,12 +18,12 @@
 #!/bin/bash
 
 export SPARK_VERSIONS_FOR_SCALA_210="2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.2.2"
-export SPARK_VERSIONS_FOR_SCALA_211="2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.2.2 2.3.0 2.3.1
2.3.2 2.4.0"
-export SPARK_VERSIONS_FOR_SCALA_212="2.4.0"
+export SPARK_VERSIONS_FOR_SCALA_211="2.1.0 2.1.1 2.1.2 2.1.3 2.2.0 2.2.1 2.2.2 2.3.0 2.3.1
2.3.2 2.4.0 2.4.1 2.4.2 2.4.3"
+export SPARK_VERSIONS_FOR_SCALA_212="2.4.0 2.4.1 2.4.2 2.4.3"
 
 export LATEST_SPARK_VERSIONS_FOR_SCALA_210="2.1.3 2.2.2"
-export LATEST_SPARK_VERSIONS_FOR_SCALA_211="2.1.3 2.2.2 2.3.2 2.4.0"
-export LATEST_SPARK_VERSIONS_FOR_SCALA_212="2.4.0"
+export LATEST_SPARK_VERSIONS_FOR_SCALA_211="2.1.3 2.2.2 2.3.2 2.4.3"
+export LATEST_SPARK_VERSIONS_FOR_SCALA_212="2.4.3"
 
 STARTTIME=$(date +%s)
 
diff --git a/datafu-spark/src/main/scala/spark/utils/overwrites/SparkPythonRunner.scala b/datafu-spark/src/main/scala/spark/utils/overwrites/SparkPythonRunner.scala
index 6345467..e86bed5 100644
--- a/datafu-spark/src/main/scala/spark/utils/overwrites/SparkPythonRunner.scala
+++ b/datafu-spark/src/main/scala/spark/utils/overwrites/SparkPythonRunner.scala
@@ -96,6 +96,7 @@ case class SparkPythonRunner(pyPaths: String,
     // This is equivalent to setting the -u flag; we use it because ipython doesn't support
-u:
     env.put("PYTHONUNBUFFERED", "YES") // value is needed to be set to a non-empty string
     env.put("PYSPARK_GATEWAY_PORT", "" + gatewayServer.getListeningPort)
+    env.put("PYSPARK_ALLOW_INSECURE_GATEWAY", "1") // needed for Spark 2.4.1 and newer, will
stop working in Spark 3.x
     builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize
     val process = builder.start()
     val writer = new BufferedWriter(


Mime
View raw message