spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maryann Xue <maryann....@databricks.com>
Subject [OSS DIGEST] The major changes of Apache Spark from Mar 11 to Mar 24
Date Tue, 21 Apr 2020 04:00:32 GMT
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, there will be an *[API] *tag in
the title.

CORE
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-30667core-change-barriertaskcontext-allgather-method-return-type-4--4>[API][3.0][SPARK-30667][CORE]
Change BarrierTaskContext allGather method return type (+4, -4)>
<https://github.com/apache/spark/commit/6fd3138e9c3c1d703f2a66bcdf17555803547774>

This PR changes the return type of the BarrierTaskContext.allGather method
to Array[String] instead of ArrayBuffer[String] since it should be
immutable.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#sql>
SQL[2.4][SPARK-31163][SQL] TruncateTableCommand with acl/permission should
handle non-existent path (+22, -1)>
<https://github.com/apache/spark/commit/cb26f636b08aea4c5c6bf5035a359cd3cbf335c0>

SPARK-30312 <https://issues.apache.org/jira/browse/SPARK-30312> added the
feature to preserve path permission & ACL when truncating table. Before
SPARK-30312 <https://issues.apache.org/jira/browse/SPARK-30312>, the
Truncate Table command can be successfully executed even if the
(partition/table) path doesn't exist. However, after SPARK-30312
<https://issues.apache.org/jira/browse/SPARK-30312>, the Truncate Table
command will fail if the path doesn't exist. This PR fixed the behavior
change by correctly handling non-existent path.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#70spark-31030sql-backward-compatibility-for-parsing-and-formatting-datetime-341--66>[3.0][SPARK-31030][SQL]
Backward Compatibility for Parsing and Formatting Datetime (+341, -66)>
<https://github.com/apache/spark/commit/3493162c78822de0563f7d736040aebdb81e936b>

In Spark 2.4 and earlier, datetime parsing, formatting and conversion are
performed by using the hybrid calendar (Julian + Gregorian). Spark 3.0 has
switched to Proleptic Gregorian calendar as it is the de-facto calendar.
The implementation depends on Java 8 APIs. However, the calendar switch
causes the breaking changes. Some patterns are not compatible between Java
8 and Java 7 time APIs. Therefore, this PR introduces Spark's own pattern
definition rather than depends on the pattern of Java time API. For keeping
the backward compatibility, this PR shadows the incompatible letters.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#api70spark-30958sql-do-not-set-default-era-for-datetimeformatter-53--6>[API][3.0][SPARK-30958][SQL]
Do not set default era for DateTimeFormatter (+53, -6)>
<https://github.com/apache/spark/commit/50a29672e061ca17598e657593db45f55aebc34b>

By default, Spark uses "uuuu" as the year pattern, which indicates the era
already. If we set a default era, it can get conflicted and fail the
parsing. Besides, we replace "y" with "u" if there is no "G". So the era is
either explicitly specified (e.g. "yyyy G") or can be inferred from the
year (e.g. "uuuu").

After the change, Spark now can parse date/timestamp with negative year via
the "yyyy" pattern, which will be converted to "uuuu" under the hood.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#70spark-31076sql-convert-catalysts-datetimestamp-to-java-datetimestamp-via-local-date-time-102--15>[3.0][SPARK-31076][SQL]
Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local
date-time (+102, -15)>
<https://github.com/apache/spark/commit/3d3e366aa836cb7d2295f54e78e544c7b15c9c08>

This PR changes conversion of java.sql.Timestamp/Date values to/from
internal values of Catalyst's TimestampType/DateType before cutover day
1582-10-15 of Gregorian calendar. It constructs the local date-time from
microseconds/days since the epoch. Take each date-time component year, month
, day, hour, minute, second and second fraction, and construct
java.sql.Timestamp/Date using the extracted components.

Before this change:

scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-03])

After the changes:

scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-10])

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#70spark-31116sql-fix-nested-schema-case-sensitivity-in-parquetrowconverter-50--2>[3.0][SPARK-31116][SQL]
Fix nested schema case-sensitivity in ParquetRowConverter (+50, -2)>
<https://github.com/apache/spark/commit/e736c62764137b2c3af90d2dc8a77e391891200a>

This PR added the case-sensitive parameter to ParquetRowConverter so that
it could handle the materialized parquet properly with respect to case
sensitivity. Otherwise, the following example will throw
IllegalArgumentException in the case-insensitive mode.

val path = "/some/temp/path"

spark
  .range(1L)
  .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS
StructColumn")
  .write.parquet(path)

val caseInsensitiveSchema = new StructType()
  .add(
    "StructColumn",
    new StructType()
      .add("LowerCase", LongType)
      .add("camelcase", LongType))

spark.read.schema(caseInsensitiveSchema).parquet(path).show()

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#70spark-31124sql-change-the-default-value-of-minpartitionnum-in-aqe-20--10>[3.0][SPARK-31124][SQL]
Change the default value of minPartitionNum in AQE (+20, -10)>
<https://github.com/apache/spark/commit/77c49cb702862a0c60733dba797201ada2f5b51a>

Adaptive Query Execution (AQE) has a perf regression when using the default
settings: if we coalesce the shuffle partitions into one or few partitions,
we may leave many CPU cores idle and the perf is worse than AQE is off
(which leverages all CPU cores).

We should try to avoid any perf regression in AQE with the default settings
as possible as we can. This PR changes the default value of minPartitionNum
when coalescing shuffle partitions, to be SparkContext.defaultParallelism,
so that AQE can leverage all the CPU cores.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#api70spark-31146sql-leverage-the-helper-method-for-aliasing-in-built-in-sql-expressions-70--55>[API][3.0][SPARK-31146][SQL]
Leverage the helper method for aliasing in built-in SQL expressions (+70,
-55)>
<https://github.com/apache/spark/commit/6704103499d2003b1879ff0b4b8e29141e401b9f>

This PR leverages the helper method for aliasing in built-in SQL
expressions to use the alias as its output column name where it's
applicable.

   - Expression, UnaryMathExpression and BinaryMathExpression search the
   alias in the tags by default.
   - When the naming is different in its implementation, it has to be
   overwritten for the expression specifically. E.g.,
   CallMethodViaReflection, Remainder, CurrentTimestamp, FormatString and
   XPathDouble.

This PR fixes the automatically generated aliases of the functions below:
classalias
Rand random
Ceil ceiling
Remainder mod
Pow pow
Signum sign
Chr char
Length char_length
Length character_length
FormatString printf
Substring substr
Upper ucase
XPathDouble xpath_number
DayOfMonth day
CurrentTimestamp now
Size cardinality
Sha1 sha
CallMethodViaReflection java_method

Note: EqualTo, = and == aliases were excluded because it's unable to
leverage this helper method. It should fix the parser.

Note: this PR also excludes some instances such as ToDegrees, ToRadians,
UnaryMinus and UnaryPositive that needs an explicit name overwritten to
make the scope of this PR smaller.

This might change the output schema of the queries. Users will see the new
output column names, if they do not manually specify the aliases.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#api70spark-31150sql-parsing-seconds-fraction-with-variable-length-for-timestamp-555--270>[API][3.0][SPARK-31150][SQL]
Parsing seconds fraction with variable length for timestamp (+555, -270)>
<https://github.com/apache/spark/commit/0946a9514f56565c78b0555383c1ece14aaf2b7b>

This PR supports parsing timestamp values with variable length second
fraction parts, like Spark 2.4.

e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6
digit-length second fraction but would return NULL when digit-length >=7

// for 0 ~ 6select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
from values
 ('2019-10-06 10:11:12.'),
 ('2019-10-06 10:11:12.0'),
 ('2019-10-06 10:11:12.1'),
 ('2019-10-06 10:11:12.12'),
 ('2019-10-06 10:11:12.123UTC'),
 ('2019-10-06 10:11:12.1234'),
 ('2019-10-06 10:11:12.12345CST'),
 ('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.1232019-10-06 08:11:12.123452019-10-06
10:11:122019-10-06 10:11:122019-10-06 10:11:12.12019-10-06
10:11:12.122019-10-06 10:11:12.12342019-10-06 10:11:12.123456
// for >= 7select to_timestamp('2019-10-06 10:11:12.1234567PST',
'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')NULL

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#api70spark-31171sql-sizenull-should-return-null-under-ansi-mode-24--3>[API][3.0][SPARK-31171][SQL]
size(null) should return null under ansi mode (+24, -3)>
<https://github.com/apache/spark/commit/dc5ebc2d5b8122121d89a9175737bea95ae10126>

The PR#27834 <https://github.com/apache/spark/pull/27834> changes the
result of size(null) to be -1 to match the Spark 2.4 behavior and avoid the
breaking changes. However, the "return -1" behavior is error-prone when
being used with aggregate functions.

The current ANSI mode controls a bunch of "better behaviors" like failing
on overflow. We don't enable these "better behaviors" by default because
they break the previous behaviors. The "return null" behavior of size(null) is
a good fit of the ANSI mode. This PR makes size(null) return null under
ANSI mode, regardless of the spark.sql.legacy.sizeOfNull config.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#api80spark-30874sql-support-postgres-kerberos-login-in-jdbc-connector-663--58>[API][3.1][SPARK-30874][SQL]
Support Postgres Kerberos login in JDBC connector (+663, -58)>
<https://github.com/apache/spark/commit/231e65092fa97516e30c4ef12e635bfe3e97c7f0>

When loading DataFrames from JDBC datasource with Kerberos authentication,
remote executors (yarn-client/cluster etc. modes) fail to establish a
connection due to lack of the Kerberos ticket or an ability to generate it.

This is a real issue when trying to ingest data from kerberized data
sources (SQL Server, Oracle) in an enterprise environment where exposing
simple authentication access is not an option due to IT policy issues.

This PR added the Postgres support (other supported databases will come in
later PRs). The newly added JDBC options:

   - *keytab* : Location of the kerberos keytab file (which must be
   pre-uploaded to all nodes either by --files option of spark-submit or
   manually) for the JDBC client. When path information found then Spark
   considers the keytab distributed manually, otherwise --files assumed. If
   both keytab and principal are defined, Spark tries to do kerberos
   authentication.
   - *principal* : Specifies kerberos principal name for the JDBC client.
   If both keytab and principal are defined, Spark tries to do kerberos
   authentication.

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#80spark-31071sql-allow-annotating-non-null-fields-when-encoding-java-beans-136--5>[3.1][SPARK-31071][SQL]
Allow annotating non-null fields when encoding Java Beans (+136, -5)>
<https://github.com/apache/spark/commit/15557a7d05d9a12dffcf82daa219c0cdb9cba39e>

When encoding Java Beans to Spark DataFrame, non-primitive types are
encoded as nullable fields by default. Although it works for most cases, it
can be still an issue. For example, when saving a DataFrame using an Avro
format with non-spark generated Avro schema with non-null field, Spark
would assume that the field is nullable. However, this assumption conflicts
with Avro schema semantics and an exception is thrown.

This PR proposes to respect javax.annotation.Nonnull and produce the
non-null fields when encoding Java Beans to Spark DataFrame.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#70spark-31090spark-25457-revert-integraldivide-returns-data-type-of-the-operands-110--263>[3.0][SPARK-31090][SPARK-25457]
Revert "IntegralDivide returns data type of the operands" (+110, -263)>
<https://github.com/apache/spark/commit/b27b3c91f113ec49ee87c877dac0a602849d76b1>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#70revert-spark-31170sql-spark-sql-cli-should-respect-hive-sitexml-and-sparksqlwarehousedir-42--55>There
is no standard requiring that div must return the type of the operand, and
always returning long type looks fine. This is kind of a cosmetic change
and we should avoid it if it breaks the existing queries. Thus, the
original commit got reverted.[3.0]Revert "[SPARK-31170][SQL] Spark SQL Cli
should respect hive-site.xml and spark.sql.warehouse.dir" (+42, -55)>
<https://github.com/apache/spark/commit/c6a6d5e006d2bdd794581464a64746c4f03abff5>

This PR reverts commit 5bc0d76.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-25121sql-supports-multi-part-relation-names-for-join-strategy-hint-resolution-230--27>[API][3.0][SPARK-25121][SQL]
Supports multi-part relation names for join strategy hint resolution (+230,
-27)>
<https://github.com/apache/spark/commit/ca499e94091ae62a6ee76ea779d7b2b4cf2dbc5c>

This PR adds support for multi-part relation names specified in join
strategy hints. Before this PR, the database name in a multi-part relation
name in SQL hints was ignored.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-30127sql-support-case-class-parameter-for-typed-scala-udf-519--410>[API][3.0][SPARK-30127][SQL]
Support case class parameter for typed Scala UDF (+519, -410)>
<https://github.com/apache/spark/commit/f6ff7d0cf8c0e562f3b086180d5418e6996055bb>

This PR adds Scala case class to the allowed data types for typed Scala
UDFs. For example, users can now write:

case class TestData(key: Int, value: String)
val f = (d: TestData) => d.key * d.value.toInt
val myUdf = udf(f)
val df = Seq(("data", TestData(50, "2"))).toDF("col1", "col2")
checkAnswer(df.select(myUdf(Column("col2"))), Row(100) :: Nil)

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-30292sqlfollowup-ansi-cast-from-strings-to-integral-numbers-byteshortintlong-should-fail-with-fraction-16--10>[API][3.0][SPARK-30292][SQL][FOLLOWUP]
ansi cast from strings to integral numbers (byte/short/int/long) should
fail with fraction (+16, -10)>
<https://github.com/apache/spark/commit/ac262cb27255f989f6a6dd864bd5114a928b96da>

This PR fails ANSI cast from floating-point number strings (e.g., "1.23")
to integral types. This is a follow-up for #26933
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#66spark-30494sql-fix-cached-data-leakage-during-replacing-an-existing-view-68--12>[2.4][SPARK-30494][SQL]
Fix cached data leakage during replacing an existing view (+68, -12)>
<https://github.com/apache/spark/commit/929b794e25ff5454dadde7da304e6df25526d60e>

This PR calls "uncache" on views that have been replaced to make sure the
memory is released for an obsolete view.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-31119sql-add-interval-value-support-for-extract-expression-as-extract-source-484--449>[API][3.0][SPARK-31119][SQL]
Add interval value support for extract expression as extract source (+484,
-449)>
<https://github.com/apache/spark/commit/f1d27cdd91d0d5d107add662f7c35fc4a5dd278c>

This PR adds support for INTERVAL values as extract source in extract
expression,
in addition to DATETIME values. Now the extract grammar is:

<extract expression> ::= EXTRACT <left paren> <extract field> FROM
<extract source> <right paren>

<extract source> ::= <datetime value expression> | <interval value expression>

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-31159sql-rebase-datetimestamp-fromto-julian-calendar-in-parquet-343--17>[API][3.0][SPARK-31159][SQL]
Rebase date/timestamp from/to Julian calendar in parquet (+343, -17)>
<https://github.com/apache/spark/commit/bb295d80e37440e6ee67a00bc09df2e9ff6e4e46>

The PR addresses the issue of compatibility with Spark 2.4 and earlier
version in reading/writing date and timestamp values via Parquet
datasource. Previous releases are based on a hybrid calendar - Julian +
Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by
default, see SPARK-26651. In particular, the issue pops up for
date/timestamp values before 1582-10-15 when the hybrid calendar switches
from/to Gregorian to/from Julian calendar. The same local date in different
calendar is converted to different number of days since the epoch
1970-01-01. For example, the 1001-01-01 date is converted to:

   - -719164 in Julian calendar. Spark 2.4 saves the number as a value of
   DATE type into parquet.
   - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as
   a date value.

This PR adds a SQL config for the option of enabling the backward
compatible behavior, which rebases from/to Proleptic Gregorian calendar to
the hybrid one:

spark.sql.legacy.parquet.rebaseDateTime.enabled

The default config value is false.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-31183sql-rebase-datetimestamp-fromto-julian-calendar-in-avro-159--15>[API][3.0][SPARK-31183][SQL]
Rebase date/timestamp from/to Julian calendar in Avro (+159, -15)>
<https://github.com/apache/spark/commit/4766a3664729644b5391b13805cdee44501025d8>

The PR addresses the issue of compatibility with Spark 2.4 and earlier
version in reading/writing date and timestamp values via Avro datasource.
Previous releases are based on a hybrid calendar - Julian + Gregorian.
Since Spark 3.0, Proleptic Gregorian calendar is used by default, see
SPARK-26651. In particular, the issue pops up for date/timestamp values
before 1582-10-15 when the hybrid calendar switches from/to Gregorian
to/from Julian calendar. The same local date in different calendar is
converted to different number of days since the epoch 1970-01-01. For
example, the 1001-01-01 date is converted to:

   - -719164 in Julian calendar. Spark 2.4 saves the number as a value of
   DATE type into Avro files.
   - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as
   a date value.

This PR adds a SQL config for the option of enabling the backward
compatible behavior, which rebases from/to Proleptic Gregorian calendar to
the hybrid one:

spark.sql.legacy.avro.rebaseDateTime.enabled

The default config value is false.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-31176sql-remove-support-for-ec-as-datetime-pattern-character-141--57>[API][3.0][SPARK-31176][SQL]
Remove support for 'e'/'c' as datetime pattern character (+141, -57)>
<https://github.com/apache/spark/commit/57fcc49306296b474c5b8b685ad13082f9b50a49>

This PR removes support of 'e'/'c' as DATETIME pattern character to avoid
ambiguity. Besides, it also fixes a bug in convertIncompatiblePattern in
which ' would be lost if used as the last character.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#70spark-31178sql-prevent-v2-exec-nodes-from-executing-multiple-times-112--25>[3.0][SPARK-31178][SQL]
Prevent V2 exec nodes from executing multiple times (+112, -25)>
<https://github.com/apache/spark/commit/4237251861c79f3176de7cf5232f0388ec5d946e>

This PR extends V2CommandExec for all the data writing commands so that
they only get executed once over multiple collect() calls.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#70spark-31190sql-scalareflection-should-not-erasure-user-defined-anyval-type-39--15>[3.0][SPARK-31190][SQL]
ScalaReflection should not erasure user defined AnyVal type (+39, -15)>
<https://github.com/apache/spark/commit/5c4d44bb83aafc296265524ec1ee09bcd11365ae>

Improve ScalaReflection to avoid erasure of non-user-defined AnyVal types
only, but not for other types, e.g. Any.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-31205sql-support-string-literal-as-the-second-argument-of-date_adddate_sub-functions-118--9>[API][3.0][SPARK-31205][SQL]
Support string literal as the second argument of date_add/date_sub
functions (+118, -9)>
<https://github.com/apache/spark/commit/1d0f54951ea66cfbfc712300ed04f6d848b2fd5a>

This PR adds support for a String literal value as the second parameter of
functions date_add and date_sub by enclosing the string parameter in an
ansi_cast function and failing the query at compile-time in case of invalid
string values.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-31211sql-fix-rebasing-of-29-february-of-julian-leap-years-77--19>[API][3.0][SPARK-31211][SQL]
Fix rebasing of 29 February of Julian leap years (+77, -19)>
<https://github.com/apache/spark/commit/db6247faa8780bca8f8d3ba71b568ea63b162973>

This PR fixes the issue of rebasing leap years in Julian calendar to
Proleptic Gregorian calendar in which the years are not leap years. For
example, before this PR, a java.time.DateTimeException would be raised
while loading the date 1000-02-29 from parquet files saved by Spark 2.4.5;
after this PR, the date can be resolved as 1000-03-01 when turning on
spark.sql.legacy.parquet.rebaseDateTime.enabled.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-31221sql-rebase-any-date-times-in-conversions-tofrom-java-types-109--100>[API][3.0][SPARK-31221][SQL]
Rebase any date-times in conversions to/from Java types (+109, -100)>
<https://github.com/apache/spark/commit/1fd4607d844905f10fe1edbfd6282c25b95b233d>

This PR applies rebasing for all dates/timestamps in conversion functions
fromJavaDate(), toJavaDate(), toJavaTimestamp() and fromJavaTimestamp().
The rebasing is performed via building a local date-time in an original
calendar, extracting date-time fields from the result, and creating new
local date-time in the target calendar.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#80spark-31184sql-support-gettablesbytype-api-of-hive-client-92--16>[3.1][SPARK-31184][SQL]
Support getTablesByType API of Hive Client (+92, -16)>
<https://github.com/apache/spark/commit/3a48ea1fe0fb85253f12d86caea01ffcb7e678d0>

This PR adds getTablesByType in HiveShim. For those Hive versions that do
not support this API, UnsupportedOperationException will be thrown, and the
upper logic would catch the exception and fallback to the "getTables +
getTablesByName + filter with type" solution.
SS
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#70spark-31126ss-upgrade-kafka-to-241-1--1>[3.0][SPARK-31126][SS]
Upgrade Kafka to 2.4.1 (+1, -1)>
<https://github.com/apache/spark/commit/614323d326db192540c955b4fa9b3b7af7527001>

Upgrade Kafka library from 2.4.0 to 2.4.1 to fix a client-side bug
KAFKA-8933 <https://issues.apache.org/jira/browse/KAFKA-8933>. See the
full release
notes <https://downloads.apache.org/kafka/2.4.1/RELEASE_NOTES.html>.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#python>
PYTHON
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#api80spark-30569sqlpysparksparkr-add-percentile_approx-dsl-functions-176--1>[API][3.1][SPARK-30569][SQL][PYSPARK][SPARKR]
Add percentile_approx DSL functions (+176, -1)>
<https://github.com/apache/spark/commit/01f20394acd5cd00edc793657c43dd790cd2112a>

Currently, we support percentile_approx only in SQL expression. This PR is
to add the percentile_approx DSL functions

   -

   Adds following overloaded variants to Scala o.a.s.sql.functions:
   - percentile_approx(e: Column, percentage: Array[Double], accuracy:
      Long): Column
      - percentile_approx(columnName: String, percentage: Array[Double],
      accuracy: Long): Column
      - percentile_approx(e: Column, percentage: Double, accuracy: Long):
      Column
      - percentile_approx(columnName: String, percentage: Double, accuracy:
      Long): Column
      - percentile_approx(e: Column, percentage: Seq[Double], accuracy:
      Long): Column (primarily for Python interop).
      - percentile_approx(columnName: String, percentage: Seq[Double],
      accuracy: Long): Column
   -

   Adds percentile_approx to pyspark.sql.functions.
   -

   Adds percentile_approx function to SparkR.

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#ui>
UI[3.1][SPARK-30654][WEBUI] Bootstrap4 WebUI upgrade (+647, -1666)>
<https://github.com/apache/spark/commit/2a4fed0443d6fe066219124833782293630f8a89>

Spark Web UI is using an older version of Bootstrap (v. 2.3.2) for the
portal pages. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x
was moved to EOL in July 2019 (https://github.com/twbs/release). Older
versions of Bootstrap are also getting flagged in security scans for
various CVEs:

   - https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889
   - https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700
   - https://snyk.io/vuln/npm:bootstrap:20180529
   - https://snyk.io/vuln/npm:bootstrap:20160627

<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#70spark-31081uisql-make-display-of-stageidstageattemptidtaskid-of-sql-metrics-toggleable-83--7>This
PR upgrades it to Bootstrap V4 to resolve any potential issue and get on a
supported release. This PR is manually tested.[3.0][SPARK-31081][UI][SQL]
Make display of stageId/stageAttemptId/taskId of sql metrics toggleable
(+83, -7)>
<https://github.com/apache/spark/commit/999c9ed10c2362d89afd3bbb48e35f3c7ac3cf89>

This PR adds a checkbox which can toggle display of stageId/taskid that
corresponds to the max metric in the SQL's DAG page.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#ml>
ML[3.1][SPARK-31032][ML] GMM compute summary and update distributions in
one job (+44, -73)>
<https://github.com/apache/spark/commit/7f3c8fa42e5850b629da6f3a3822a33c6f4a7dac>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#80spark-31077ml-remove-chisqselector-dependency-on-mllibchisqselectormodel-124--43>[3.1][SPARK-31077][ML]
Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel (+124, -43)>
<https://github.com/apache/spark/commit/a1a665bece21c94c557b9fd2ce4835b49fc79649>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api70spark-30773ml-support-nativeblas-for-level-1-routines-49--19>[API][3.0][SPARK-30773][ML]
Support NativeBlas for level-1 routines (+49, -19)>
<https://github.com/apache/spark/commit/fae981e5f32e0ddb86591a616829423dfafb4ed0>

This PR provides a way to allow user take advantage of NativeBLAS for
level-1 routines.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api80spark-31138ml-add-anova-selector-for-continuous-features-and-categorical-labels-860--6>[API][3.1][SPARK-31138][ML]
Add ANOVA Selector for continuous features and categorical labels (+860, -6)
>
<https://github.com/apache/spark/commit/9a990133f6ea9776e56694b3371fabc6600387b1>

This PR adds ANOVASelector for continuous features and categorical labels.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#api80spark-31185ml-implement-variancethresholdselector-411--0>[API][3.1][SPARK-31185][ML]
Implement VarianceThresholdSelector (+411, -0)>
<https://github.com/apache/spark/commit/307cfe1f8eacef38fc41a46d94273a5bb6b95fbe>

Implement a Feature selector that removes all low-variance features.
Features with a variance lower than the threshold will be removed. The
default is to keep all features with non-zero variance, i.e. remove the
features that have the same value in all samples.
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#other>
OTHER[2.4][SPARK-31095][BUILD] Upgrade netty-all to 4.1.47.Final (+4, -4)>
<https://github.com/apache/spark/commit/93def95b0801842e0288a77b3a97f84d31b57366>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#80spark-25355k8s-add-proxy-user-to-driver-if-present-on-spark-submit-89--11>[3.1][SPARK-25355][K8S]
Add proxy user to driver if present on spark-submit (+89, -11)>
<https://github.com/apache/spark/commit/ed06d980440666619e5af54919437c5dfdc53cfc>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#80spark-31125k8s-terminating-pods-have-a-deletion-timestamp-but-they-are-not-yet-dead-12--1>[3.1][SPARK-31125][K8S]
Terminating pods have a deletion timestamp but they are not yet dead (+12,
-1)>
<https://github.com/apache/spark/commit/57d27e900f79e6c5699b9a23db236aae98e761ad>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-11-~-Mar-17,-2020#80spark-31120build-support-enabling-maven-profiles-for-importing-via-sbt-on-intellij-idea-2--1>[3.1][SPARK-31120][BUILD]
Support enabling maven profiles for importing via sbt on Intellij IDEA (+2,
-1)>
<https://github.com/apache/spark/commit/3b6da36cd62e734b12405bf994a228c7f5049e77>
<https://github.com/databricks/runtime/wiki/OSS-Digest-Mar-18-~-Mar-24,-2020#66spark-31101build-upgrade-janino-to-3016-7--7>[2.4][SPARK-31101][BUILD]
Upgrade Janino to 3.0.16 (+7, -7)>
<https://github.com/apache/spark/commit/f55f6b569beea3636549f8a71949cd2bca2a813b>

Mime
View raw message