spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [OSS DIGEST] The major changes of Apache Spark from May 20 to June 2
Date Tue, 07 Jul 2020 08:53:22 GMT
Wu is currently having an issue about the access in the dev mailing list
access for an unexpected issue, and the previous email was malformed for
that reason.
I am resending this as it looks going to take a while to fix the issue.
Please ignore the previous email and read this.

Thanks.


CORE
<https://github.com/databricks/runtime/wiki/wy#api31spark-8981core-add-mdc-support-in-executor-27--1>[API][3.1][SPARK-8981][CORE]
Add MDC support in Executor (+27, -1)>
<https://github.com/apache/spark/commit/eaf7a2a4edf2351907fde174ce9e694633cc4765>

This PR added MDC(Mapped Diagnostic Context) support for task threads. By
default, each log line printed by the same task thread will include the
same unique task name. Besides, user can also add the custom content to
logs by configuring the log4j pattern. For example, application IDs/names.
This is important when the clusters is shared by different
users/applications.

Before:

scala> testDf.collect()
...20/04/28 16:41:58 WARN MemoryStore: Failed to reserve initial
memory threshold of 1024.0 KB for computing block broadcast_1 in
memory.20/04/28 16:41:58 WARN MemoryStore: Not enough space to cache
broadcast_1 in memory! (computed 384.0 B so far)20/04/28 16:41:58 WARN
MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB
for computing block broadcast_0 in memory.20/04/28 16:41:58 WARN
MemoryStore: Not enough space to cache broadcast_0 in memory!
(computed 384.0 B so far)20/04/28 16:41:58 WARN RowBasedKeyValueBatch:
Calling spill() on RowBasedKeyValueBatch. Will not spill but return
0.20/04/28 16:41:58 WARN RowBasedKeyValueBatch: Calling spill() on
RowBasedKeyValueBatch. Will not spill but return 0.20/04/28 16:41:58
WARN RowBasedKeyValueBatch: Failed to allocate page (1048576
bytes).20/04/28 16:41:58 WARN RowBasedKeyValueBatch: Failed to
allocate page (1048576 bytes).20/04/28 16:41:58 ERROR Executor:
Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire
262144 bytes of memory, got 22200
...

After(please note the end of each line):

scala> testDf.collect()
...20/04/28 16:40:59 WARN MemoryStore: Failed to reserve initial
memory threshold of 1024.0 KB for computing block broadcast_1 in
memory [task 1.0 in stage 0.0].20/04/28 16:40:59 WARN MemoryStore: Not
enough space to cache broadcast_1 in memory! (computed 384.0 B so far)
[task 1.0 in stage 0.0]20/04/28 16:40:59 WARN MemoryStore: Failed to
reserve initial memory threshold of 1024.0 KB for computing block
broadcast_0 in memory. [task 1.0 in stage 0.0]20/04/28 16:40:59 WARN
MemoryStore: Not enough space to cache broadcast_0 in memory!
(computed 384.0 B so far) [task 1.0 in stage 0.0] 20/04/28 16:40:59
WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch.
Will not spill but return 0. [task 0.0 in stage 0.0]20/04/28 16:40:59
WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch.
Will not spill but return 0. [task 1.0 in stage 0.0]20/04/28 16:40:59
WARN RowBasedKeyValueBatch: Failed to allocate page (1048576 bytes).
[task 0.0 in stage 0.0]20/04/28 16:40:59 WARN RowBasedKeyValueBatch:
Failed to allocate page (1048576 bytes). [task 1.0 in stage 0.0]
20/04/28 16:41:00 ERROR Executor: Exception in task 0.0 in stage 0.0
(TID 0)
org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire
262144 bytes of memory, got 22200 [task 0.0 in stage 0.0]
...

<https://github.com/databricks/runtime/wiki/wy#api31spark-29150core-update-rdd-api-for-stage-level-scheduling-to-be-public-29--25>[API][3.1][SPARK-29150][CORE]
Update RDD API for Stage level scheduling to be public (+29, -25)>
<https://github.com/apache/spark/commit/ff4a97d4b61dbddc96f9a3d7746531117621033a>

This PR makes the access level of the RDD api for stage level scheduling
public.
<https://github.com/databricks/runtime/wiki/wy#sql>SQL
<https://github.com/databricks/runtime/wiki/wy#api30spark-31750sql-eliminate-upcast-if-childs-datatype-is-decimaltype-52--8>[API][3.0][SPARK-31750][SQL]
Eliminate UpCast if child's dataType is DecimalType (+52, -8)>
<https://github.com/apache/spark/commit/0fd98abd859049dc3b200492487041eeeaa8f737>

Eliminate the UpCast that are implicitly added by Spark, if its child data
type is already the decimal type. Otherwise, for cases like:

sql("select cast(1 as decimal(38, 0)) as d")
  .write.mode("overwrite")
  .parquet(f.getAbsolutePath)

spark.read.parquet(f.getAbsolutePath).as[BigDecimal]

could fail as follow:

[info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d`
from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose
a higher precision type of the field in the target object;
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)

<https://github.com/databricks/runtime/wiki/wy#api30spark-31755sql-allow-missing-yearhour-when-parsing-datetimestamp-string-370--92>[API][3.0][SPARK-31755][SQL]
Allow missing year/hour when parsing date/timestamp string (+370, -92)>
<https://github.com/apache/spark/commit/ce4da29ec3b90b43ad03a6284da791431a8c6b5f>

In order to keep backward compatibility with Spark 2.4, this PR allows
missing year and hour fields when parsing date/timestamp string and uses
the year 1970 and the hour 0 as the default values.

In Spark 3.0,

Before:

scala> sql("select to_timestamp('16',
'dd')").show+--------------------+|to_timestamp(16,
dd)|+--------------------+|                null|+--------------------+


scala> sql("select to_date('16',
'dd')").show+---------------+|to_date(16, dd)|+---------------+|
    null|+---------------+

scala> sql("select to_timestamp('2019 40', 'yyyy
mm')").show+------------------------------+|to_timestamp(2019 40, yyyy
mm)|+------------------------------+|           2019-01-01
00:00:00|+------------------------------+


scala> sql("select to_timestamp('2019 10:10:10', 'yyyy
hh:mm:ss')").show+------------------------------------------+|to_timestamp(2019
10:10:10, yyyy hh:mm:ss)|+------------------------------------------+|
                      2019-01-01
00:00:00|+------------------------------------------+

After:

scala> sql("select to_timestamp('16',
'dd')").show+------------------------+|to_timestamp('16',
'dd')|+------------------------+|     1970-01-16
00:00:00|+------------------------+


scala> sql("select to_date('16',
'dd')").show+-------------------+|to_date('16',
'dd')|+-------------------+|         1970-01-16|+-------------------+


scala> sql("select to_timestamp('2019 40', 'yyyy
mm')").show+----------------------------------+|to_timestamp('2019
40', 'yyyy mm')|+----------------------------------+|
2019-01-01 00:40:00|+----------------------------------+


scala> sql("select to_timestamp('2019 10:10:10', 'yyyy
hh:mm:ss')").show+----------------------------------------------+|to_timestamp('2019
10:10:10', 'yyyy
hh:mm:ss')|+----------------------------------------------+|
                2019-01-01
10:10:10|+----------------------------------------------+

<https://github.com/databricks/runtime/wiki/wy#30spark-31762sql-fix-perf-regression-of-datetimestamp-formatting-in-tohivestring-138--52>[3.0][SPARK-31762][SQL]
Fix perf regression of date/timestamp formatting in toHiveString (+138, -52)
>
<https://github.com/apache/spark/commit/5d673319af81bb826e5f532b5ff25f2d4b2da122>

This PR avoids both unnecessary overhead of converting Java date-time types
to micros/days before formatting and unnecessary conversion from input
micros/days to Java types for the formatters.
<https://github.com/databricks/runtime/wiki/wy#api30spark-31771sql-disable-narrow-textstyle-for-datetime-pattern-gmleuqq-1441--50>[API][3.0][SPARK-31771][SQL]
Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' (+1441, -50)>
<https://github.com/apache/spark/commit/695cb617d42507eded9c7e50bc7cd5333bbe6f83>

In Spark 3.0, 5 continuous pattern characters with 'G/M/L/E/u/Q/q' is
Narrow-Text Style while it's Full-Text Style in Spark 2.4. With Narrow-Text
Style, Spark will only output the leading single letter of the value, e.g.
December would be D. This PR disables Narrow-Text Style for these pattern
characters in order to avoid the silent data change.

After this PR, queries with DateTime operations using DateTime patterns,
e.g. G/M/L/E/u, will fail if the pattern length is 5. But for other
patterns, e,g. 'k', 'm', they can still accept a certain number of letters.

As a result, using DateTime patterns like "GGGGG", "MMMMM", "LLLLL",
"EEEEE", "uuuuu", "aa", "aaa", which are not supported by the new parser
but the legacy parser, will hit an SparkUpgradeException. To bypass the
exception, users can switch to the legacy parser or change to the new
DateTime patterns. However, using DateTime patterns like "QQQQQ", "qqqqq",
which are not supported by both the new parser and the legacy parser, will
hit an IllegalArgumentException. This exception will be swallowed by Spark
and the value null will be returned.
<https://github.com/databricks/runtime/wiki/wy#api30spark-31808sql-makes-struct-functions-output-name-and-class-name-pretty-40--16>[API][3.0][SPARK-31808][SQL]
Makes struct function's output name and class name pretty (+40, -16)>
<https://github.com/apache/spark/commit/df2a1fe131476aac128d63df9b06ec4bca0c2c07>

This PR corrects struct's alias name and class name in ExpressionInfo.

Before:

scala> sql("DESC FUNCTION
struct").show(false)+------------------------------------------------------------------------------------+|function_desc

|+------------------------------------------------------------------------------------+|Function:
struct
   ||Class: org.apache.spark.sql.catalyst.expressions.NamedStruct
                  ||Usage: struct(col1, col2, col3, ...) - Creates a
struct with the given field
values.|+------------------------------------------------------------------------------------+

scala> sql("SELECT struct(1,
2)").show(false)+------------------------------+|named_struct(col1, 1,
col2, 2)|+------------------------------+|[1, 2]
 |+------------------------------+

After:

scala> sql("DESC FUNCTION
struct").show(false)+------------------------------------------------------------------------------------+|function_desc

|+------------------------------------------------------------------------------------+|Function:
struct
   ||Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct
                 ||Usage: struct(col1, col2, col3, ...) - Creates a
struct with the given field
values.|+------------------------------------------------------------------------------------+

scala> sql("SELECT struct(1, 2)").show(false)+------------+|struct(1,
2)|+------------+|[1, 2]      |+------------+

<https://github.com/databricks/runtime/wiki/wy#api30spark-31818sql-fix-pushing-down-filters-with-javatimeinstant-values-in-orc-77--48>[API][3.0][SPARK-31818][SQL]
Fix pushing down filters with java.time.Instant values in ORC (+77, -48)>
<https://github.com/apache/spark/commit/6c80ebbccb7f354f645dd63a73114332d26f901f>

When spark.sql.datetime.java8API.enabled=true, pushing down filters with
java.time.Instant to ORC datasource can fail by:

java.lang.IllegalArgumentException: Wrong value class
java.time.Instant for TIMESTAMP.EQUALS leaf
 at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
 at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)

This PR fixes the error by converting java.time.Instant to
java.sql.Timestamp in the filters.
<https://github.com/databricks/runtime/wiki/wy#api31spark-30352sql-datasourcev2-add-current_catalog-function-43--5>[API][3.1][SPARK-30352][SQL]
DataSourceV2: Add CURRENT_CATALOG function (+43, -5)>
<https://github.com/apache/spark/commit/0df8dd60733066076967f0525210bbdb5e12415a>

This PR adds the built-in SQL function, CURRENT_CATALOG for DataSourceV2
only.
<https://github.com/databricks/runtime/wiki/wy#api31spark-31673sql-queryexectiondebugtofile-to-take-an-additional-explain-mode-param-90--34>[API][3.1][SPARK-31673][SQL]
QueryExection.debug.toFile() to take an additional explain mode param (+90,
-34)>
<https://github.com/apache/spark/commit/b44acee95343f3508f12384b17e43cde65220924>

Previously, QueryExecution.debug.toFile always uses the Extended mode to
dump the query plan information. This PR allows users to specify the
desired explain mode.
<https://github.com/databricks/runtime/wiki/wy#api31spark-31710sql-adds-timestamp_seconds-timestamp_millis-and-timestamp_micros-functions-237--2>[API][3.1][SPARK-31710][SQL]
Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
(+237, -2)>
<https://github.com/apache/spark/commit/2115c55efe5d4bfe60ff4271f15624bd16a01929>

This PR adds the three built-in SQL functions [TIMESTAMP_SECONDS,
TIMESTAMP_MILLIS, TIMESTAMP_MICROS] in order to provide the convenient ways
to create timestamps by interpreting the values as the number of seconds,
milliseconds and microseconds since 1970-01-01 00:00:00 UTC respectively.

For example,

sql("select TIMESTAMP_SECONDS(t.a) as timestamp from
values(1230219000),(-1230219000) as
t(a)").show(false)+-------------------------+|timestamp
|+-------------------------+|2008-12-25 23:30:00      ||1931-01-07
16:30:00      |+-------------------------+

sql("select TIMESTAMP_MILLIS(t.a) as timestamp from
values(1230219000123),(-1230219000123) as
t(a)").show(false)+-------------------------------+|timestamp
            |+-------------------------------+|2008-12-25 23:30:00.123
       ||1931-01-07 16:29:59.877
|+-------------------------------+


sql("select TIMESTAMP_MICROS(t.a) as timestamp from
values(1230219000123123),(-1230219000123123) as
t(a)").show(false)+------------------------------------+|timestamp

|+------------------------------------+|2008-12-25 23:30:00.123123
     ||1931-01-07 16:29:59.876877
|+------------------------------------+

<https://github.com/databricks/runtime/wiki/wy#api30spark-31761sql-cast-integer-to-long-to-avoid-integeroverflow-for-integraldivide-operator-57--12>[API][3.0][SPARK-31761][SQL]
Cast integer to Long to avoid IntegerOverflow for IntegralDivide operator
(+57, -12)>
<https://github.com/apache/spark/commit/cf7463f30968ae599d34015f8fa8c379da74757c>

This PR casts Byte/Short/Integer to Long for the left and right children of
IntegralDivide to avoid overflow.

For cases like:

// the df is constructed from : (-2147483648, -1) --> (_c0, _c1)val
res = df.selectExpr("_c0 div _c1")
res.collect

Before:

res1: Array[org.apache.spark.sql.Row] = Array([-2147483648])

After:

res1: Array[org.apache.spark.sql.Row] = Array([2147483648])

<https://github.com/databricks/runtime/wiki/wy#31spark-31793sql-reduce-the-memory-usage-in-file-scan-location-metadata-61--4>[3.1][SPARK-31793][SQL]
Reduce the memory usage in file scan location metadata (+61, -4)>
<https://github.com/apache/spark/commit/9fdc2a08011f8571516fa9103102481d4cf9cfde>

Currently, the operator of Data Source Scan stores all the paths in its
metadata. The metadata is kept when a SparkPlan is converted into
SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan graph
in UI. However, the paths can be still very large (e.g., when many
partitions still remain after partition pruning), while UI pages only can
show up to 100 bytes for the location metadata. Thus, this PR reduces the
number of paths stored in metadata to reduce memory usage.
<https://github.com/databricks/runtime/wiki/wy#30spark-31354-sparkcontext-should-only-register-one-sparksession-applicationend-listener-41--11>[3.0][SPARK-31354]
SparkContext should only register one SparkSession ApplicationEnd listener
(+41, -11)>
<https://github.com/apache/spark/commit/dae79888dc6476892877d3b3b233381cdbf7fa74>

This PR makes sure that getOrCreate only registers Spark listener once. For
example,

SparkSession.builder().master("local").getOrCreate()SparkSession.clearActiveSession()SparkSession.clearDefaultSession()
SparkSession.builder().master("local").getOrCreate()SparkSession.clearActiveSession()SparkSession.clearDefaultSession()

Before this PR, there are two listeners registered at ListenerBus:

[org.apache.spark.status.AppStatusListener5f610071,
org.apache.spark.HeartbeatReceiverd400c17,
org.apache.spark.sql.SparkSession$$anon$125849aeb, <- first listener
org.apache.spark.sql.SparkSession$$anon$1fadb9a0]  <- second listener

After this PR, there's only one listener registered at ListenerBus:

[org.apache.spark.status.AppStatusListener5f610071,
org.apache.spark.HeartbeatReceiverd400c17,
org.apache.spark.sql.SparkSession$$anon$125849aeb] <- only one listener

<https://github.com/databricks/runtime/wiki/wy#api24spark-31854sql-invoke-in-mapelementsexec-should-not-propagate-null-12--2>[API][2.4][SPARK-31854][SQL]
Invoke in MapElementsExec should not propagate null (+12, -2)>
<https://github.com/apache/spark/commit/b806fc458265578fddf544363b60fb5e122439b5>

This PR fixes a NullPointerException caused by Dataset.map when whole-stage
codegen is enabled by setting propagateNull to false when initializing
Invoke.
<https://github.com/databricks/runtime/wiki/wy#30spark-31827sql-fail-datetime-parsingformatting-if-detect-the-java-8-bug-of-stand-alone-form-19--4>[3.0][SPARK-31827][SQL]
fail datetime parsing/formatting if detect the Java 8 bug of stand-alone
form (+19, -4)>
<https://github.com/apache/spark/commit/1528fbced83a7fbcf70e09d6a898728370d8fa62>

This PR detects the usage of the LLL/qqq datetime pattern string under JDK8
and throws an exception with a clear error message, to avoid hitting a JDK8
bug: https://bugs.openjdk.java.net/browse/JDK-8114833.
<https://github.com/databricks/runtime/wiki/wy#30spark-31867sql-disable-year-type-datetime-patterns-which-are-longer-than-10-99--10>[3.0][SPARK-31867][SQL]
Disable year type datetime patterns which are longer than 10 (+99, -10)>
<https://github.com/apache/spark/commit/547c5bf55265772780098ee5e29baa6f095c246b>

This PR fails usage of datetime pattern in the form of y..y and Y..Y with
lengths greater than 10 to avoid hitting a JDK bug.
<https://github.com/databricks/runtime/wiki/wy#30spark-31874sql-use-fastdateformat-as-the-legacy-fractional-formatter-6--1>[3.0][SPARK-31874][SQL]
Use FastDateFormat as the legacy fractional formatter (+6, -1)>
<https://github.com/apache/spark/commit/47dc332258bec20c460f666de50d9a8c5c0fbc0a>

This PR:

   1. replaces SimpleDateFormat by FastDateFormat as the legacy formatter
   of FractionTimestampFormatter, to utilize the internal cache of
   FastDateFormat, and avoid parsing the default pattern yyyy-MM-dd HH:mm:ss
   .
   2. optimizes LegacyFastTimestampFormatter for java.sql.Timestamp w/o
   fractional part, to avoid conversions to microseconds for patterns without
   the fractional part.

<https://github.com/databricks/runtime/wiki/wy#30spark-31885sql-fix-filter-push-down-for-old-millis-timestamps-to-parquet-25--22>[3.0][SPARK-31885][SQL]
Fix filter push down for old millis timestamps to Parquet (+25, -22)>
<https://github.com/apache/spark/commit/9c0dc28a6c33361a412672d63f3b974b75944965>

This PR fixes conversions of java.sql.Timestamp to milliseconds in
ParquetFilter by using existing functions from DateTimeUtils
fromJavaTimestamp() and microsToMillis().
<https://github.com/databricks/runtime/wiki/wy#30apispark-31888sql-support-javatimeinstant-in-parquet-filter-pushdown-83--70>[3.0][API][SPARK-31888][SQL]
Support java.time.Instant in Parquet filter pushdown (+83, -70)>
<https://github.com/apache/spark/commit/00b355b97b9bb55d39f4e32e14c14b428cca6fda>

This PR enables push down of filters with java.time.Instant attributes by:

   1. modifying ParquetFilters.valueCanMakeFilterOn() to accept filters
   with java.time.Instant attributes.
   2. adding ParquetFilters.timestampToMicros() to support both types
   java.sql.Timestamp and java.time.Instant in conversions to microseconds.
   3. reusing timestampToMicros in constructing of Parquet filters.

<https://github.com/databricks/runtime/wiki/wy#31apispark-28067sql-fix-incorrect-results-for-decimal-aggregate-sum-by-returning-null-on-decimal-overflow-211--25>[3.1][API][SPARK-28067][SQL]
Fix incorrect results for decimal aggregate sum by returning null on
decimal overflow (+211, -25)>
<https://github.com/apache/spark/commit/4161c624294dbdcb86939350bba61bebf1d99505>

This PR fixes wrong results in sum aggregate function with decimals in case
of overflow by adding an extra flag field in the sum agg function.
<https://github.com/databricks/runtime/wiki/wy#31spark-28481sql-more-expressions-should-extend-nullintolerant-180--103>[3.1][SPARK-28481][SQL]
More expressions should extend NullIntolerant (+180, -103)>
<https://github.com/apache/spark/commit/91148f428b61b5e44c17bd65ceff74e0a8b4b3f5>

This PR makes more expressions extend NullIntolerant, and as a result can
avoid skew join if the join column has many null values and improve query
performance.
<https://github.com/databricks/runtime/wiki/wy#30spark-31859spark-31861spark-31863-fix-thriftserver-session-timezone-issues-209--185>[3.0][SPARK-31859][SPARK-31861][SPARK-31863]
Fix Thriftserver session timezone issues (+209, -185)>
<https://github.com/apache/spark/commit/af35691de449aca2e8db8d6dd7092255a919a04b>

This PR fixes:

   1. SPARK-31861 "Thriftserver collecting timestamp not using
   spark.sql.session.timeZone" by converting the Timestamp values to String
   earlier, in SparkExecuteStatementOperation, using
   HiveResults.toHiveString().
   2. SPARK-31859 "Thriftserver not working with
   spark.sql.datetime.java8API.enabled=true" by using
   HiveResults.toHiveString().

<https://github.com/databricks/runtime/wiki/wy#ml>ML
<https://github.com/databricks/runtime/wiki/wy#api31spark-31734mlpyspark-add-weight-support-in-clusteringevaluator-167--67>[API][3.1][SPARK-31734][ML][PYSPARK]
Add weight support in ClusteringEvaluator (+167, -67)>
<https://github.com/apache/spark/commit/d4007776f2dd85f03f3811ab8ca711f221f62c00>

This PR adds setWeightCol method to ClusteringEvaluator as
BinaryClassificationEvaluator, RegressionEvaluator,
MulticlassClassificationEvaluator do.
<https://github.com/databricks/runtime/wiki/wy#api31spark-31768ml-add-getmetrics-in-evaluators-905--591>[API][3.1][SPARK-31768][ML]
Add getMetrics in Evaluators (+905, -591)>
<https://github.com/apache/spark/commit/d0fe433b5ee7ae3ba41838dd30b5dbe9f6c4e713>

Currently, Evaluator.evaluate can only access to one metrics. This PR adds
getMetrics method in all the Evaluators to allow users to get multiple
metrics.

For example:

  val trainer = new LinearRegression
  val model = trainer.fit(dataset)
  val predictions = model.transform(dataset)

  val evaluator = new RegressionEvaluator()

  val metrics = evaluator.getMetrics(predictions)
  val rmse = metrics.rootMeanSquaredError
  val r2 = metrics.r2
  val mae = metrics.meanAbsoluteError
  val variance = metrics.explainedVariance

<https://github.com/databricks/runtime/wiki/wy#31spark-31803ml-make-sure-instance-weight-is-not-negative-32--16>[3.1][SPARK-31803][ML]
Make sure instance weight is not negative (+32, -16)>
<https://github.com/apache/spark/commit/50492c0bd353b5bd6c06099a53d55377a91a6705>

This PR adds checks to make sure instance weight is not negative in the
algorithms that support instance weight.
<https://github.com/databricks/runtime/wiki/wy#31spark-31840ml-add-instance-weight-support-in-logisticregressionsummary-134--43>[3.1][SPARK-31840][ML]
Add instance weight support in LogisticRegressionSummary (+134, -43)>
<https://github.com/apache/spark/commit/45cf5e99503b00a6bd83ea94d6d92761db1a00ab>

Add instance weight support in LogisticRegressionSummary to match its
capabilities with those of LogisticRegression,
MulticlassClassificationEvaluator and BinaryClassificationEvaluator.
<https://github.com/databricks/runtime/wiki/wy#ss>SS
<https://github.com/databricks/runtime/wiki/wy#api30spark-31706ss-add-back-the-support-of-streaming-update-mode-54--20>[API][3.0][SPARK-31706][SS]
Add back the support of streaming update mode (+54, -20)>
<https://github.com/apache/spark/commit/34414acfa36d1a91cd6a64761625c8b0bd90c0a7>

In Spark 2.4, all built-in v2 streaming sinks support all streaming output
modes, and v2 sinks are enabled by default(see SPARK-22911
<https://issues.apache.org/jira/browse/SPARK-22911>). To keep the backward
compatibility, we added back the support of streaming UPDATE mode that was
dropped in the unreleased 3.0 branch.
<https://github.com/databricks/runtime/wiki/wy#api30spark-31792ssdocs-introduce-the-structured-streaming-ui-in-the-web-ui-doc-28--0>[API][3.0][SPARK-31792][SS][DOCS]
Introduce the structured streaming UI in the Web UI doc (+28, -0)>
<https://github.com/apache/spark/commit/8ba2b47737f83bd27ebef75dea9e73b3649c2ec1>

This PR adds the structured streaming UI introduction in the Web UI doc.
<https://github.com/databricks/runtime/wiki/wy#31spark-30915ss-compactiblefilestreamlog-avoid-reading-the-metadata-log-file-when-finding-the-latest-batch-id-102--10>[3.1][SPARK-30915][SS]
CompactibleFileStreamLog: Avoid reading the metadata log file when finding
the latest batch ID (+102, -10)>
<https://github.com/apache/spark/commit/5a258b0b67ee7c97a90d8b719c7a2171707c9244>

When the compacted metadata log file becomes huge, writing outputs for the
compact + 1 batch can be slow due to unnecessarily reading the compacted
metadata log file. To get rid of the unnecessary reading, this PR adds
getLatestBatchId() method in CompactibleFileStreamLog in the complement of
getLatest(). The method doesn't read the content of the latest batch
metadata log file, and can be applied to both FileStreamSource and
FileStreamSink to avoid unnecessary latency when reading the log file.
<https://github.com/databricks/runtime/wiki/wy#python>PYTHON
<https://github.com/databricks/runtime/wiki/wy#30spark-31788coredstreampython-recover-the-support-of-union-for-different-types-of-rdd-and-dstreams-39--5>[3.0][SPARK-31788][CORE][DSTREAM][PYTHON]
Recover the support of union for different types of RDD and DStreams (+39,
-5)>
<https://github.com/apache/spark/commit/29c51d682b3735123f78cf9cb8610522a9bb86fd>

This PR manually specifies the class for the input array being used in
(SparkContext|StreamingContext).union. It fixes a regression introduced
from SPARK-25737.
<https://github.com/databricks/runtime/wiki/wy#30spark-31849pythonsql-make-pyspark-sql-exceptions-more-pythonic-61--17>[3.0][SPARK-31849][PYTHON][SQL]
Make PySpark SQL exceptions more Pythonic (+61, -17)>
<https://github.com/apache/spark/commit/e69466056fb2c121b7bbb6ad082f09deb1c41063>

This PR makes PySpark exception more Pythonic by hiding JVM stacktrace by
default, when the JVM exceptions are the analyzer's exceptions,
AnalysisException, ParseException, StreamingQueryException,
QueryExecutionException, IllegalArgumentException, and PythonException
[thrown by Python UDFs]. It can be enabled by turning on
spark.sql.pyspark.jvmStacktrace.enabled configuration.
<https://github.com/databricks/runtime/wiki/wy#31spark-25351sqlpython-handle-pandas-category-type-when-converting-from-python-with-arrow-52--0>[3.1][SPARK-25351][SQL][PYTHON]
Handle Pandas category type when converting from Python with Arrow (+52, -0)
>
<https://github.com/apache/spark/commit/339b0ecadb9c66ec8a62fd1f8e5a7a266b465aef>

This PR adds support for Pandas category type while converting from python
with Arrow enabled. The category column will be converted to whatever type
the category elements are as is the case with Arrow disabled.
<https://github.com/databricks/runtime/wiki/wy#api31spark-31763pyspark-add-inputfiles-method-in-pyspark-dataframe-class-32--0>[API][3.1][SPARK-31763][PYSPARK]
Add inputFiles method in PySpark DataFrame Class (+32, -0)>
<https://github.com/apache/spark/commit/2f92ea0df4ef1f127d25a009272005e5ad8811fa>

This PR adds inputFiles() method to PySpark DataFrame to enable PySpark
users to list all files constituting a DataFrame.
<https://github.com/databricks/runtime/wiki/wy#ui>UI
<https://github.com/databricks/runtime/wiki/wy#api31spark-29303web-ui-add-ui-support-for-stage-level-scheduling-657--103>[API][3.1][SPARK-29303][WEB
UI] Add UI support for stage level scheduling (+657, -103)>
<https://github.com/apache/spark/commit/b64688ebbaac7afd3734c0d84d1e77b1fd2d2e9d>

This PR adds the UI support for stage level scheduling and ResourceProfiles:

   - Add ResourceProfile Id to Stage page
   - Add ResourceProfile Id to Executor page
   - Add a section with ResourceProfile Id to Environment page

Also, the rest API for the environment page is updated to include the
ResourceProfile information.

[image: Screen Shot 2020-04-01 at 3 07 46 PM] [image: Screen Shot
2020-04-01 at 3 08 14 PM] [image: Screen Shot 2020-04-01 at 3 09 03 PM] [image:
Screen Shot 2020-04-01 at 11 05 48 AM]
<https://github.com/databricks/runtime/wiki/wy#api31spark-31642-add-pagination-support-for-structured-streaming-page-209--91>[API][3.1][SPARK-31642]
Add Pagination Support for Structured Streaming Page (+209, -91)>
<https://github.com/apache/spark/commit/fbb3144a9ce3e7d418d79d9ddfadbb0591779674>

This PR adds the pagination support for the structured streaming page in
order to:

   - Help users to analyze structured streaming queries in a much better way
   - Improve the consistent of Spark Web UI across various pages
   - Prevent potential OOM

<https://github.com/databricks/runtime/wiki/wy#30spark-31882webui-dag-viz-is-not-rendered-correctly-with-pagination-31--6>[3.0][SPARK-31882][WEBUI]
DAG-viz is not rendered correctly with pagination (+31, -6)>
<https://github.com/apache/spark/commit/271eb26c026490f7307ae888200bfc81a645592f>

This PR fixes a DAG-viz bug, in which rendering fails with pagination when
DAG-viz fetches link urls for each stage of a job from the stage table.
<https://github.com/databricks/runtime/wiki/wy#other>OTHER
<https://github.com/databricks/runtime/wiki/wy#30spark-31786k8sbuild-upgrade-kubernetes-client-to-492-14--11>[3.0][SPARK-31786][K8S][BUILD]
Upgrade kubernetes-client to 4.9.2 (+14, -11)>
<https://github.com/apache/spark/commit/64ffc6649623e3cb568315f57c9e06be3b547c00>
<https://github.com/databricks/runtime/wiki/wy#api31spark-31759deploy-support-configurable-max-number-of-rotate-logs-for-spark-daemons-11--3>[API][3.1][SPARK-31759][DEPLOY]
Support configurable max number of rotate logs for spark daemons (+11, -3)>
<https://github.com/apache/spark/commit/7e2ed40d58a1b4a75b6d596383026bc952215492>
<https://github.com/databricks/runtime/wiki/wy#31spark-31214build-upgrade-janino-to-312-14--12>[3.1][SPARK-31214][BUILD]
Upgrade Janino to 3.1.2 (+14, -12)>
<https://github.com/apache/spark/commit/fe1d1e24bc9a48c0f2638def9a7f9a743069273f>

This PR upgrades Janino to 3.1.2:
http://janino-compiler.github.io/janino/changelog.html
<https://github.com/databricks/runtime/wiki/wy#31spark-31858build-upgrade-commons-io-to-25-in-hadoop-32-profile-2--2>[3.1][SPARK-31858][BUILD]
Upgrade commons-io to 2.5 in Hadoop 3.2 profile (+2, -2)>
<https://github.com/apache/spark/commit/625abca9db7976065d0e0e8ecf96e378fbc35749>

This PR upgrades commons-io from 2.4 to 2.5 for Apache Spark 3.1
<https://github.com/databricks/runtime/wiki/wy#31spark-31876build-upgrade-to-zstd-145-4--4>[3.1][SPARK-31876][BUILD]
Upgrade to Zstd 1.4.5 (+4, -4)>
<https://github.com/apache/spark/commit/367d94a30d946637529c319eab8817df84e78926>

This PR aims to upgrade to Zstd 1.4.5:
https://github.com/facebook/zstd/releases/tag/v1.4.5


2020년 7월 7일 (화) 오후 4:47, wuyi <yi.wu@databricks.com>님이 작성:

> Hi all,
>
> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
> For each API/configuration/behavior change, an [API] tag is added in the
> title.
>
> Hi all,
>
> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
> For each API/configuration/behavior change, an [API] tag is added in the
> title.
>
>
> CORE[API][8.0][SPARK-29150][CORE] Update RDD API for Stage level scheduling
> to be public (+29, -25)>This PR makes the access level of the RDD api for
> stage level scheduling public.[API][7.1][SPARK-8981][CORE] Add MDC support
> in Executor (+27, -1)>This PR added MDC(Mapped Diagnostic Context) support
> for task threads. By default, each log line printed by the same task thread
> will include the same unique task name. Besides, user can also add the
> custom content to logs by configuring the log4j pattern. For example,
> application IDs/names. This is important when the clusters is shared by
> different users/applications.Before:scala> testDf.collect()...20/04/28
> 16:41:58 WARN MemoryStore: Failed to reserve initial memory threshold of
> 1024.0 KB for computing block broadcast_1 in memory.20/04/28 16:41:58 WARN
> MemoryStore: Not enough space to cache broadcast_1 in memory! (computed
> 384.0 B so far)20/04/28 16:41:58 WARN MemoryStore: Failed to reserve
> initial
> memory threshold of 1024.0 KB for computing block broadcast_0 in
> memory.20/04/28 16:41:58 WARN MemoryStore: Not enough space to cache
> broadcast_0 in memory! (computed 384.0 B so far)20/04/28 16:41:58 WARN
> RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not
> spill but return 0.20/04/28 16:41:58 WARN RowBasedKeyValueBatch: Calling
> spill() on RowBasedKeyValueBatch. Will not spill but return 0.20/04/28
> 16:41:58 WARN RowBasedKeyValueBatch: Failed to allocate page (1048576
> bytes).20/04/28 16:41:58 WARN RowBasedKeyValueBatch: Failed to allocate
> page
> (1048576 bytes).20/04/28 16:41:58 ERROR Executor: Exception in task 1.0 in
> stage 0.0 (TID 1)org.apache.spark.memory.SparkOutOfMemoryError: Unable to
> acquire 262144 bytes of memory, got 22200...After(please note the end of
> each line):scala> testDf.collect()...20/04/28 16:40:59 WARN MemoryStore:
> Failed to reserve initial memory threshold of 1024.0 KB for computing block
> broadcast_1 in memory [task 1.0 in stage 0.0].20/04/28 16:40:59 WARN
> MemoryStore: Not enough space to cache broadcast_1 in memory! (computed
> 384.0 B so far) [task 1.0 in stage 0.0]20/04/28 16:40:59 WARN MemoryStore:
> Failed to reserve initial memory threshold of 1024.0 KB for computing block
> broadcast_0 in memory. [task 1.0 in stage 0.0]20/04/28 16:40:59 WARN
> MemoryStore: Not enough space to cache broadcast_0 in memory! (computed
> 384.0 B so far) [task 1.0 in stage 0.0] 20/04/28 16:40:59 WARN
> RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not
> spill but return 0. [task 0.0 in stage 0.0]20/04/28 16:40:59 WARN
> RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not
> spill but return 0. [task 1.0 in stage 0.0]20/04/28 16:40:59 WARN
> RowBasedKeyValueBatch: Failed to allocate page (1048576 bytes). [task 0.0
> in
> stage 0.0]20/04/28 16:40:59 WARN RowBasedKeyValueBatch: Failed to allocate
> page (1048576 bytes). [task 1.0 in stage 0.0] 20/04/28 16:41:00 ERROR
> Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 262144
> bytes of memory, got 22200 [task 0.0 in stage
> 0.0]...SQL[API][7.0][SPARK-31750][SQL] Eliminate UpCast if child's dataType
> is DecimalType (+52, -8)>Eliminate the UpCast that are implicitly added by
> Spark, if its child data type is already the decimal type. Otherwise, for
> cases like:sql("select cast(1 as decimal(38, 0)) as d")
> .write.mode("overwrite")
>
> .parquet(f.getAbsolutePath)spark.read.parquet(f.getAbsolutePath).as[BigDecimal]could
> fail as follow:[info] org.apache.spark.sql.AnalysisException: Cannot up
> cast
> `d` from decimal(38,0) to decimal(38,18).[info] The type path of the target
> object is:[info] - root class: "scala.math.BigDecimal"[info] You can either
> add an explicit cast to the input data or choose a higher precision type of
> the field in the target object;[info] at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)[info]
> at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)[info]
> at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)[info]
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)[info]
> at
>
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)[info]
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)[info]
> at
>
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)[API][7.0][SPARK-31755][SQL]
> Allow missing year/hour when parsing date/timestamp string (+370, -92)>In
> order to keep backward compatibility with Spark 2.4, this PR allows missing
> year and hour fields when parsing date/timestamp string and uses the
> year 1970 and the hour 0 as the default values.In Spark 3.0,Before:scala>
> sql("select to_timestamp('16',
> 'dd')").show+--------------------+|to_timestamp(16,
> dd)|+--------------------+| null|+--------------------+scala> sql("select
> to_date('16', 'dd')").show+---------------+|to_date(16,
> dd)|+---------------+| null|+---------------+scala> sql("select
> to_timestamp('2019 40', 'yyyy
> mm')").show+------------------------------+|to_timestamp(2019 40, yyyy
> mm)|+------------------------------+| 2019-01-01
> 00:00:00|+------------------------------+scala> sql("select
> to_timestamp('2019 10:10:10', 'yyyy
>
> hh:mm:ss')").show+------------------------------------------+|to_timestamp(2019
> 10:10:10, yyyy hh:mm:ss)|+------------------------------------------+|
> 2019-01-01
> 00:00:00|+------------------------------------------+After:scala>
> sql("select to_timestamp('16',
> 'dd')").show+------------------------+|to_timestamp('16',
> 'dd')|+------------------------+| 1970-01-16
> 00:00:00|+------------------------+scala> sql("select to_date('16',
> 'dd')").show+-------------------+|to_date('16',
> 'dd')|+-------------------+|
> 1970-01-16|+-------------------+scala> sql("select to_timestamp('2019 40',
> 'yyyy mm')").show+----------------------------------+|to_timestamp('2019
> 40', 'yyyy mm')|+----------------------------------+| 2019-01-01
> 00:40:00|+----------------------------------+scala> sql("select
> to_timestamp('2019 10:10:10', 'yyyy
>
> hh:mm:ss')").show+----------------------------------------------+|to_timestamp('2019
> 10:10:10', 'yyyy
> hh:mm:ss')|+----------------------------------------------+| 2019-01-01
>
> 10:10:10|+----------------------------------------------+[7.0][SPARK-31762][SQL]
> Fix perf regression of date/timestamp formatting in toHiveString (+138,
> -52)>This PR avoids both unnecessary overhead of converting Java date-time
> types to micros/days before formatting and unnecessary conversion from
> input
> micros/days to Java types for the formatters.[API][7.0][SPARK-31771][SQL]
> Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' (+1441,
> -50)>In Spark 3.0, 5 continuous pattern characters with 'G/M/L/E/u/Q/q' is
> Narrow-Text Style while it's Full-Text Style in Spark 2.4. With Narrow-Text
> Style, Spark will only output the leading single letter of the value,
> e.g. December would be D. This PR disables Narrow-Text Style for these
> pattern characters in order to avoid the silent data change.After this PR,
> queries with DateTime operations using DateTime patterns, e.g. G/M/L/E/u,
> will fail if the pattern length is 5. But for other patterns, e,g. 'k',
> 'm',
> they can still accept a certain number of letters.As a result, using
> DateTime patterns like "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa",
> "aaa", which are not supported by the new parser but the legacy parser,
> will
> hit an SparkUpgradeException. To bypass the exception, users can switch to
> the legacy parser or change to the new DateTime patterns. However, using
> DateTime patterns like "QQQQQ", "qqqqq", which are not supported by both
> the
> new parser and the legacy parser, will hit an IllegalArgumentException.
> This
> exception will be swallowed by Spark and the value null will be
> returned.[API][7.0][SPARK-31808][SQL] Makes struct function's output name
> and class name pretty (+40, -16)>This PR corrects struct's alias name and
> class name in ExpressionInfo.Before:scala> sql("DESC FUNCTION
>
> struct").show(false)+------------------------------------------------------------------------------------+|function_desc
>
> |+------------------------------------------------------------------------------------+|Function:
> struct ||Class: org.apache.spark.sql.catalyst.expressions.NamedStruct
> ||Usage: struct(col1, col2, col3, ...) - Creates a struct with the given
> field
>
> values.|+------------------------------------------------------------------------------------+scala>
> sql("SELECT struct(1,
> 2)").show(false)+------------------------------+|named_struct(col1, 1,
> col2,
> 2)|+------------------------------+|[1, 2]
> |+------------------------------+After:scala> sql("DESC FUNCTION
>
> struct").show(false)+------------------------------------------------------------------------------------+|function_desc
>
> |+------------------------------------------------------------------------------------+|Function:
> struct ||Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct
> ||Usage: struct(col1, col2, col3, ...) - Creates a struct with the given
> field
>
> values.|+------------------------------------------------------------------------------------+scala>
> sql("SELECT struct(1, 2)").show(false)+------------+|struct(1,
> 2)|+------------+|[1, 2] |+------------+[API][7.0][SPARK-31818][SQL] Fix
> pushing down filters with java.time.Instant values in ORC (+77,
> -48)>When spark.sql.datetime.java8API.enabled=true, pushing down filters
> with java.time.Instant to ORC datasource can fail
> by:java.lang.IllegalArgumentException: Wrong value class java.time.Instant
> for TIMESTAMP.EQUALS leaf at
> org.apache.hadoop.hive.ql.io
> .sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
> at
> org.apache.hadoop.hive.ql.io
> .sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)This
> PR fixes the error by converting java.time.Instant to java.sql.Timestamp in
> the filters.[API][8.0][SPARK-30352][SQL] DataSourceV2: Add CURRENT_CATALOG
> function (+43, -5)>This PR adds the built-in SQL
> function, CURRENT_CATALOG for DataSourceV2
> only.[API][8.0][SPARK-31673][SQL]
> QueryExection.debug.toFile() to take an additional explain mode param (+90,
> -34)>Previously, QueryExecution.debug.toFile always uses the Extended mode
> to dump the query plan information. This PR allows users to specify the
> desired explain mode.[API][8.0][SPARK-31710][SQL] Adds TIMESTAMP_SECONDS,
> TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions (+237, -2)>This PR adds the
> three built-in SQL functions
> [TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS] in order to provide
> the convenient ways to create timestamps by interpreting the values as the
> number of seconds, milliseconds and microseconds since 1970-01-01 00:00:00
> UTC respectively.For example,sql("select TIMESTAMP_SECONDS(t.a) as
> timestamp
> from values(1230219000),(-1230219000) as
> t(a)").show(false)+-------------------------+|timestamp
> |+-------------------------+|2008-12-25 23:30:00 ||1931-01-07 16:30:00
> |+-------------------------+sql("select TIMESTAMP_MILLIS(t.a) as timestamp
> from values(1230219000123),(-1230219000123) as
> t(a)").show(false)+-------------------------------+|timestamp
> |+-------------------------------+|2008-12-25 23:30:00.123 ||1931-01-07
> 16:29:59.877 |+-------------------------------+sql("select
> TIMESTAMP_MICROS(t.a) as timestamp from
> values(1230219000123123),(-1230219000123123) as
> t(a)").show(false)+------------------------------------+|timestamp
> |+------------------------------------+|2008-12-25 23:30:00.123123
> ||1931-01-07 16:29:59.876877
> |+------------------------------------+[API][7.0][SPARK-31761][SQL] Cast
> integer to Long to avoid IntegerOverflow for IntegralDivide operator (+57,
> -12)>This PR casts Byte/Short/Integer to Long for the left and right
> children of IntegralDivide to avoid overflow.For cases like:// the df is
> constructed from : (-2147483648, -1) --> (_c0, _c1)val res =
> df.selectExpr("_c0 div _c1")res.collectBefore:res1:
> Array[org.apache.spark.sql.Row] = Array([-2147483648])After:res1:
> Array[org.apache.spark.sql.Row] =
> Array([2147483648])[8.0][SPARK-31793][SQL]
> Reduce the memory usage in file scan location metadata (+61, -4)>Currently,
> the operator of Data Source Scan stores all the paths in its metadata. The
> metadata is kept when a SparkPlan is converted
> into SparkPlanInfo. SparkPlanInfo can be used to construct the Spark plan
> graph in UI. However, the paths can be still very large (e.g., when many
> partitions still remain after partition pruning), while UI pages only can
> show up to 100 bytes for the location metadata. Thus, this PR reduces the
> number of paths stored in metadata to reduce memory
> usage.[API][6.7][SPARK-31854][SQL] Invoke in MapElementsExec should not
> propagate null (+12, -2)>This PR fixes a NullPointerException caused
> by Dataset.map when whole-stage codegen is enabled by
> setting propagateNull to false when
> initializing Invoke.[7.0][SPARK-31827][SQL] fail datetime
> parsing/formatting
> if detect the Java 8 bug of stand-alone form (+19, -4)>This PR detects the
> usage of the LLL/qqq datetime pattern string under JDK8 and throws an
> exception with a clear error message, to avoid hitting a JDK8
> bug:
> https://bugs.openjdk.java.net/browse/JDK-8114833.[7.0][SPARK-31867][SQL]
> Disable year type datetime patterns which are longer than 10 (+99,
> -10)>This
> PR fails usage of datetime pattern in the form of y..y and Y..Y with
> lengths
> greater than 10 to avoid hitting a JDK bug.[7.0][SPARK-31874][SQL]
> Use FastDateFormat as the legacy fractional formatter (+6, -1)>This
> PR:replaces SimpleDateFormat by FastDateFormat as the legacy formatter
> of FractionTimestampFormatter, to utilize the internal cache
> of FastDateFormat, and avoid parsing the default pattern yyyy-MM-dd
> HH:mm:ss.optimizes LegacyFastTimestampFormatter for java.sql.Timestamp w/o
> fractional part, to avoid conversions to microseconds for patterns without
> the fractional part.[7.0][SPARK-31885][SQL] Fix filter push down for old
> millis timestamps to Parquet (+25, -22)>This PR fixes conversions
> of java.sql.Timestamp to milliseconds in ParquetFilter by using existing
> functions
>
> from DateTimeUtils fromJavaTimestamp() and microsToMillis().[7.0][API][SPARK-31888][SQL]
> Support java.time.Instant in Parquet filter pushdown (+83, -70)>This PR
> enables push down of filters with java.time.Instant attributes
> by:modifying ParquetFilters.valueCanMakeFilterOn() to accept filters
>
> with java.time.Instant attributes.adding ParquetFilters.timestampToMicros() to
> support both types java.sql.Timestamp and java.time.Instant in conversions
> to microseconds.reusing timestampToMicros in constructing of Parquet
> filters.[8.0][API][SPARK-28067][SQL] Fix incorrect results for decimal
> aggregate sum by returning null on decimal overflow (+211, -25)>This PR
> fixes wrong results in sum aggregate function with decimals in case of
> overflow by adding an extra flag field in the sum agg
> function.[8.0][SPARK-28481][SQL] More expressions should extend
> NullIntolerant (+180, -103)>This PR makes more expressions
> extend NullIntolerant, and as a result can avoid skew join if the join
> column has many null values and improve query
> performance.[7.0][SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver
> session timezone issues (+209, -185)>This PR fixes:SPARK-31861
> "Thriftserver
> collecting timestamp not using spark.sql.session.timeZone" by converting
> the
> Timestamp values to String earlier, in SparkExecuteStatementOperation,
> using HiveResults.toHiveString().SPARK-31859 "Thriftserver not working with
> spark.sql.datetime.java8API.enabled=true" by
> using HiveResults.toHiveString().[7.0][SPARK-31354] SparkContext should
> only
> register one SparkSession ApplicationEnd listener (+41, -11)>This PR makes
> sure that getOrCreate only registers Spark listener once. For
>
> example,SparkSession.builder().master("local").getOrCreate()SparkSession.clearActiveSession()SparkSession.clearDefaultSession()SparkSession.builder().master("local").getOrCreate()SparkSession.clearActiveSession()SparkSession.clearDefaultSession()Before
> this PR, there are two listeners registered
>
> at ListenerBus:[org.apache.spark.status.AppStatusListener5f610071,org.apache.spark.HeartbeatReceiverd400c17,org.apache.spark.sql.SparkSession$$anon$125849aeb,
> <- first listenerorg.apache.spark.sql.SparkSession$$anon$1fadb9a0] <-
> second
> listenerAfter this PR, there's only one listener registered
>
> at ListenerBus:[org.apache.spark.status.AppStatusListener5f610071,org.apache.spark.HeartbeatReceiverd400c17,org.apache.spark.sql.SparkSession$$anon$125849aeb]
> <- only one listenerML[API][8.0][SPARK-31734][ML][PYSPARK] Add weight
> support in ClusteringEvaluator (+167, -67)>This PR adds setWeightCol method
>
> to ClusteringEvaluator as BinaryClassificationEvaluator, RegressionEvaluator, MulticlassClassificationEvaluator do.[8.0][SPARK-31803][ML]
> Make sure instance weight is not negative (+32, -16)>This PR adds checks to
> make sure instance weight is not negative in the algorithms that support
> instance weight.[8.0][SPARK-31840][ML] Add instance weight support in
> LogisticRegressionSummary (+134, -43)>Add instance weight support in
> LogisticRegressionSummary to match its capabilities with those of
> LogisticRegression, MulticlassClassificationEvaluator and
> BinaryClassificationEvaluator.[API][8.0][SPARK-31768][ML] Add getMetrics in
> Evaluators (+905, -591)>Currently, Evaluator.evaluate can only access to
> one
> metrics. This PR adds getMetrics method in all the Evaluators to allow
> users
> to get multiple metrics.For example: val trainer = new LinearRegression val
> model = trainer.fit(dataset) val predictions = model.transform(dataset) val
> evaluator = new RegressionEvaluator() val metrics =
> evaluator.getMetrics(predictions) val rmse = metrics.rootMeanSquaredError
> val r2 = metrics.r2 val mae = metrics.meanAbsoluteError val variance =
> metrics.explainedVariancePYTHON[7.0][SPARK-31788][CORE][DSTREAM][PYTHON]
> Recover the support of union for different types of RDD and DStreams (+39,
> -5)>This PR manually specifies the class for the input array being used
> in (SparkContext|StreamingContext).union. It fixes a regression introduced
> from SPARK-25737.[7.0][SPARK-31849][PYTHON][SQL] Make PySpark SQL
> exceptions
> more Pythonic (+61, -17)>This PR makes PySpark exception more Pythonic by
> hiding JVM stacktrace by default, when the JVM exceptions are the
> analyzer's
> exceptions, AnalysisException, ParseException, StreamingQueryException,
> QueryExecutionException, IllegalArgumentException, and PythonException
> [thrown by Python UDFs]. It can be enabled by turning
>
> on spark.sql.pyspark.jvmStacktrace.enabled configuration.[8.0][SPARK-25351][SQL][PYTHON]
> Handle Pandas category type when converting from Python with Arrow (+52,
> -0)>This PR adds support for Pandas category type while converting from
> python with Arrow enabled. The category column will be converted to
> whatever
> type the category elements are as is the case with Arrow
> disabled.[API][8.0][SPARK-31763][PYSPARK] Add inputFiles method in PySpark
> DataFrame Class (+32, -0)>This PR adds inputFiles() method to
> PySpark DataFrame to enable PySpark users to list all files constituting
> a DataFrame.SS[API][7.0][SPARK-31706][SS] Add back the support of streaming
> update mode (+54, -20)>In Spark 2.4, all built-in v2 streaming sinks
> support
> all streaming output modes, and v2 sinks are enabled by
> default(see SPARK-22911). To keep the backward compatibility, we added back
> the support of streaming UPDATE mode that was dropped in the unreleased 3.0
> branch.[API][7.0][SPARK-31792][SS][DOCS] Introduce the structured streaming
> UI in the Web UI doc (+28, -0)>This PR adds the structured streaming UI
> introduction in the Web UI doc.[8.0][SPARK-30915][SS]
> CompactibleFileStreamLog: Avoid reading the metadata log file when finding
> the latest batch ID (+102, -10)>When the compacted metadata log file
> becomes
> huge, writing outputs for the compact + 1 batch can be slow due to
> unnecessarily reading the compacted metadata log file. To get rid of the
> unnecessary reading, this PR adds getLatestBatchId() method
> in CompactibleFileStreamLog in the complement of getLatest(). The method
> doesn't read the content of the latest batch metadata log file, and can be
> applied to both FileStreamSource and FileStreamSink to avoid unnecessary
> latency when reading the log file.UI[API][8.0][SPARK-29303][WEB UI] Add UI
> support for stage level scheduling (+657, -103)>This PR adds the UI support
> for stage level scheduling and ResourceProfiles:Add ResourceProfile Id to
> Stage pageAdd ResourceProfile Id to Executor pageAdd a section with
> ResourceProfile Id to Environment pageAlso, the rest API for the
> environment
> page is updated to include the ResourceProfile information.
> [7.0][SPARK-31882][WEBUI] DAG-viz is not rendered correctly with pagination
> (+31, -6)>This PR fixes a DAG-viz bug, in which rendering fails with
> pagination when DAG-viz fetches link urls for each stage of a job from the
> stage table.[API][8.0][SPARK-31642] Add Pagination Support for Structured
> Streaming Page (+209, -91)>This PR adds the pagination support for the
> structured streaming page in order to:Help users to analyze structured
> streaming queries in a much better wayImprove the consistent of Spark Web
> UI
> across various pagesPrevent potential
> OOMOTHER[7.0][SPARK-31786][K8S][BUILD]
> Upgrade kubernetes-client to 4.9.2 (+14,
> -11)>[API][8.0][SPARK-31759][DEPLOY] Support configurable max number of
> rotate logs for spark daemons (+11, -3)>
> [8.0][SPARK-31214][BUILD] Upgrade Janino to 3.1.2 (+14, -12)>This PR
> upgrades Janino to
> 3.1.2:
> http://janino-compiler.github.io/janino/changelog.html[8.0][SPARK-31858][BUILD]
> Upgrade commons-io to 2.5 in Hadoop 3.2 profile (+2, -2)>This PR
> upgrades commons-io from 2.4 to 2.5 for Apache Spark
> 3.1[8.0][SPARK-31876][BUILD] Upgrade to Zstd 1.4.5 (+4, -4)>This PR aims to
> upgrade to Zstd 1.4.5:
> https://github.com/facebook/zstd/releases/tag/v1.4.5
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message