spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason White (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch
Date Sat, 11 Feb 2017 21:37:42 GMT
Jason White created SPARK-19561:
-----------------------------------

             Summary: Pyspark Dataframes don't allow timestamps near epoch
                 Key: SPARK-19561
                 URL: https://issues.apache.org/jira/browse/SPARK-19561
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.1.0, 2.0.1
            Reporter: Jason White


Pyspark does not allow timestamps at or near the epoch to be created in a DataFrame. Related
issue: https://issues.apache.org/jira/browse/SPARK-19299

TimestampType.toInternal converts a datetime object to a number representing microseconds
since the epoch. For all times more than 2148 seconds before or after 1970-01-01T00:00:00+0000,
this number is greater than 2^31 and Py4J automatically serializes it as a long.

However, for times within this range (~35 minutes before or after the epoch), Py4J serializes
it as an int. When creating the object on the Scala side, ints are not recognized and the
value goes to null. This leads to null values in non-nullable fields, and corrupted Parquet
files.

The solution is trivial - force TimestampType.toInternal to always return a long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message