spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
Date Fri, 01 Feb 2019 02:22:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon updated SPARK-26745:
---------------------------------
    Fix Version/s: 2.4.1

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs
with empty lines
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26745
>                 URL: https://issues.apache.org/jira/browse/SPARK-26745
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: Branden Smith
>            Assignee: Hyukjin Kwon
>            Priority: Blocker
>              Labels: correctness
>             Fix For: 2.4.1, 3.0.0
>
>
> The optimization introduced by [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959]
(improving performance of {{{color:#0000FF}count(){color}}} for DataFrames read from non-multiline
JSON in {{{color:#0000FF}PERMISSIVE{color}}} mode) appears to cause {{{color:#0000FF}count(){color}}}
to erroneously include empty lines in its result total if run prior to JSON parsing taking
place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
>         { "a" : 4 , "b" : 5 , "c" : 6 }
>      
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser runs, the optimization
also causes the count to appear to be unstable upon cache/persist operations, as shown above.
> CSV inputs, also optimized via [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959],
do not appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message