spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandeep Katta (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-27695) SELECT * returns null column when reading from Hive / ORC and spark.sql.hive.convertMetastoreOrc=true
Date Wed, 09 Oct 2019 06:24:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-27695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947385#comment-16947385
] 

Sandeep Katta commented on SPARK-27695:
---------------------------------------

does this jira applicable for 2.4 version also ?

> SELECT * returns null column when reading from Hive / ORC and spark.sql.hive.convertMetastoreOrc=true
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27695
>                 URL: https://issues.apache.org/jira/browse/SPARK-27695
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, Spark Core
>    Affects Versions: 2.3.1, 2.3.2, 2.3.3
>            Reporter: Oscar Cassetti
>            Priority: Major
>
> If you do 
> {code:java}
> select * from hive.some_table{code}
> and the underlying data does not match exactly the schema the last column is returned
as null 
> Example 
> {code:java}
> from pyspark import SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true')
> spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> data = [{'a':i, 'b':i+10, 'd':{'a':i, 'b':i+10}} for i in range(1, 100)]
> data_schema = StructType([StructField('a', LongType(), True),
> StructField('b', LongType(), True),
> StructField('d', MapType(StringType(), LongType(), True), True)
> ])
> rdd = spark.sparkContext.parallelize(data)
> df = rdd.toDF(data_schema)
> df.write.format("orc").save("./sample_data/")
> spark.sql("""create external table tmp(
> a bigint,
> b bigint,
> d map<string, bigint>)
> stored as orc
> location 'sample_data/'
> """)
> spark.sql("select * from tmp").show()
> {code}
> This return correctl
> {noformat}
> +---+---+-------------------+
> |  a|  b|                  d|
> +---+---+-------------------+
> | 85| 95| [a -> 85, b -> 95]|
> | 86| 96| [a -> 86, b -> 96]|
> | 87| 97| [a -> 87, b -> 97]|
> | 88| 98| [a -> 88, b -> 98]|
> | 89| 99| [a -> 89, b -> 99]|
> | 90|100|[a -> 90, b -> 100]|
> {noformat}
> However if add a new column in the underlying data without altering the hive schema 
> the last column of the hive schema is set to null
> {code}
> from pyspark import SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> conf = SparkConf().set('spark.sql.hive.convertMetastoreOrc', 'true')
> spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> data = [{'a':i, 'b':i+10, 'c':i+5, 'd':{'a':i, 'b':i+10, 'c':i+5}} for i in range(1,
100)]
> data_schema = StructType([StructField('a', LongType(), True),
> 	                      StructField('b', LongType(), True),
> 	                      StructField('c', LongType(), True),
> 	                      StructField('d', MapType(StringType(), LongType(), True), True)
> 	                      ])
> rdd = spark.sparkContext.parallelize(data)
> df = rdd.toDF(data_schema)
> df.write.format("orc").mode("overwrite").save("./sample_data/")
> spark.sql("select * from tmp").show()
> spark.read.orc("./sample_data/").show()
> {code}
> The first show() returns 
> {noformat}
> +---+---+----+
> |  a|  b|   d|
> +---+---+----+
> | 85| 95|null|
> | 86| 96|null|
> | 87| 97|null|
> | 88| 98|null|
> | 89| 99|null|
> | 90|100|null|
> | 91|101|null|
> | 92|102|null|
> | 93|103|null|
> | 94|104|null|
> {noformat}
> But the data on disk is correct
> {noformat}
> +---+---+---+--------------------+
> |  a|  b|  c|                   d|
> +---+---+---+--------------------+
> | 85| 95| 90|[a -> 85, b -> 95...|
> | 86| 96| 91|[a -> 86, b -> 96...|
> | 87| 97| 92|[a -> 87, b -> 97...|
> | 88| 98| 93|[a -> 88, b -> 98...|
> | 89| 99| 94|[a -> 89, b -> 99...|
> | 90|100| 95|[a -> 90, b -> 10...|
> | 91|101| 96|[a -> 91, b -> 10...|
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message