spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pawan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
Date Thu, 01 Nov 2018 20:41:00 GMT
Pawan created SPARK-25919:
-----------------------------

             Summary: Date value corrupts when tables are "ParquetHiveSerDe" formatted and
target table is Partitioned
                 Key: SPARK-25919
                 URL: https://issues.apache.org/jira/browse/SPARK-25919
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, Spark Shell, Spark Submit
    Affects Versions: 2.2.1
            Reporter: Pawan


Hi

I found a really strange issue. Below are the steps to reproduce it. This issue occurs only
when the table row format is ParquetHiveSerDe and the target table is Partitioned

*Hive:*

Login in to hive terminal on cluster and create below tables.
{code:java}
create table t_src(
name varchar(10),
dob timestamp
)
ROW FORMAT SERDE 
 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
create table t_tgt(
name varchar(10),
dob timestamp
)
PARTITIONED BY (city varchar(10))
ROW FORMAT SERDE 
 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';


{code}
Insert data into the source table (t_src)
{code:java}
INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'),
('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');{code}
*Spark-shell:*

Get on to spark-shell. 

Execute below commands on spark shell:
{code:java}
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q0 = "TRUNCATE table t_tgt"
val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM DEFAULT.t_src alias"
val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as c0, tbl0.a1 as
c1, NULL as c2 FROM tbl0"
sqlContext.sql(q0)
sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
sqlContext.sql(q2)


{code}
 After this check the contents of target table t_tgt. You will see the date "0001-01-01 00:00:00"
changed to "0002-01-01 00:00:00". Below snippets shows the contents of both the tables:

 
{code:java}
select * from t_src;
+-------------+------------------------+--+
| t_src.name | t_src.dob |
+-------------+------------------------+--+
| p1 | 0001-01-01 00:00:00.0 |
| p2 | 0002-01-01 00:00:00.0 |
| p3 | 0003-01-01 00:00:00.0 |
| p4 | 0004-01-01 00:00:00.0 |
+-------------+------------------------+–+
 select * from t_tgt;
+-------------+------------------------+--+
| t_src.name | t_src.dob | t_tgt.city |
+-------------+------------------------+--+
| p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
| p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
| p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
| p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
+-------------+------------------------+--+
{code}
 

Is this a known issue? Is it fixed in any subsequent releases?

Thanks & regards,

Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message