hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng Lian (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6394) Implement Timestmap in ParquetSerde
Date Tue, 25 Aug 2015 10:19:46 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711046#comment-14711046
] 

Cheng Lian commented on HIVE-6394:
----------------------------------

While testing Spark SQL 1.5-SNAPSHOT for Parquet/Hive compatibility, we hit SPARK-10177. 
In short, Spark SQL and Hive both have their own Julian date conversion code, and their results
don't match.  Currently, we've fixed this issue by making Spark SQL behave the same as Hive
so that we can interoperate (see [Spark PR #8400|https://github.com/apache/spark/pull/8400]).
 However, Hive's behavior looks a little bit weird to me: when converting a calendar timestamp
to a Julian timestamp, Hive always gives a result 12 hours later than the expected result.

This behavior can be verified by the following Spark 1.5-SNAPSHOT shell snippet (I'm using
Spark 1.5-SNAPSHOT shell since it comes with Hive 1.2.1 dependencies):
{code}
import java.sql._
import java.util._

import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._

TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val timestamp = Timestamp.valueOf("1970-00-00 00:00:00")

val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos

println(
  s"""Hive converts "$timestamp" to Julian timestamp:
     |(julianDay=$hiveJulianDay, timeOfDayNanos=$hiveTimeOfDayNanos)
   """.stripMargin)
{code}
The result is:
{noformat}
Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)
{noformat}
According to definition on [this page|http://aa.usno.navy.mil/data/docs/JulianDate.php], Julian
dates count from noon.  Namely "00:00:00" of any calendar date must map to a Julian timestamp
with a fraction of 0.5, i.e. an integral date plus 12 hours.  And the correct Julian timestamp
given by the converter in the aforementioned page is "2440587.500000", which is equivalent
to:
{noformat}
(julianDay=2440587, timeOfDayNanos=43200000000000)
{noformat}
This means, INT96 timestamp values stored in Parquet files written by Hive all have a 12 hr
offset.  (I haven't tried to verify this issue against Impala.)

This shouldn't a big problem though, as long as the read path always correctly decode written
timestamp values.  Just curious, is this 12 hr offset intentional?


> Implement Timestmap in ParquetSerde
> -----------------------------------
>
>                 Key: HIVE-6394
>                 URL: https://issues.apache.org/jira/browse/HIVE-6394
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Serializers/Deserializers
>            Reporter: Jarek Jarcec Cecho
>            Assignee: Szehon Ho
>              Labels: Parquet
>             Fix For: 0.14.0
>
>         Attachments: HIVE-6394.2.patch, HIVE-6394.3.patch, HIVE-6394.4.patch, HIVE-6394.5.patch,
HIVE-6394.6.patch, HIVE-6394.6.patch, HIVE-6394.7.patch, HIVE-6394.patch
>
>
> This JIRA is to implement timestamp support in Parquet SerDe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message