spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suchintak Patnaik (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
Date Wed, 30 Oct 2019 06:32:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962741#comment-16962741
] 

Suchintak Patnaik commented on SPARK-29621:
-------------------------------------------

[~hyukjin.kwon] count() returns the row count, it has noting to do with the column. 

In both of the cases whether it is count() or show(), first filter() is performed on the dataframe
based on the corrupt_record column

*df.filter(df._corrupt_record.isNotNull()).*count()    // Error

*df.filter(df._corrupt_record.isNotNull()).*show()    // No Error

> Querying internal corrupt record column should not be allowed in filter operation
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-29621
>                 URL: https://issues.apache.org/jira/browse/SPARK-29621
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Suchintak Patnaik
>            Priority: Major
>              Labels: PySpark, SparkSQL
>
> As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced
columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in case of *filter*
operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
>     StructField("_corrupt_record", StringType(), False),
>     StructField("Name", StringType(), False),
>     StructField("Colour", StringType(), True),
>     StructField("Price", IntegerType(), True),
>     StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message