spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: sql timestamp timezone bug
Date Thu, 17 Mar 2016 23:39:51 GMT
On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson
<Andy@santacruzintegration.com> wrote:
> I am using pyspark 1.6.0 and
> datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
> data
>
> The data is originally captured by a spark streaming app and written to
> Cassandra. The value of the timestamp comes from
>
> Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>()
>                 Š});
>
> I am confident the time stamp is stored correctly in cassandra and that
> the clocks on the machines in my cluster are set correctly
>
> I noticed that if I used Cassandra CQLSH to select a data set between two
> points in time the row count did not match the row count I got when I did
> the same select in spark using SQL, It appears the spark sql assumes all
> timestamp strings are in the local time zone.
>
>
> Here is what I expect. (this is what is returned by CQLSH)
> cqlsh> select
>    ...     count(row_key) as num_samples, sum(count) as total, max(count)
> as max
>    ... from
>    ...     notification.json_timeseries
>    ... where
>    ...     row_key in (Œred', Œblue')
>    ...     and created > '2016-03-12 00:30:00+0000'
>    ...     and created <= '2016-03-12 04:30:00+0000'
>    ... allow filtering;
>
>  num_samples | total| max
> -------------+------------------+---------------
>         3242 |11277 |  17
>
>
> Here is  my pyspark select statement. Notice the Œcreated column encodes
> the timezone¹. I am running this on my local mac (in PST timezone) and
> connecting to my data center (which runs on UTC) over a VPN.
>
> rawDF = sqlContext.read\
>         .format("org.apache.spark.sql.cassandra")\
>         .options(table="json_timeseries", keyspace="notification")\
>         .load()
>
>
> rawDF.registerTempTable(tmpTableName)
>
>
>
> stmnt = "select \
>         row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
>         unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \
>         to_utc_timestamp(created, 'gmt') as gmt \
>         from \
>         rawTable \
>         where \
>                 (created > '{0}') and (created <= '{1}') \
>                 and \
>                         (row_key = Œred' or row_key = Œblue¹) \
>         )".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000')
>
> rawDF = sqlCtx.sql(stmnt).cache()

What's the type of `created`? TimestampType?

If yes, when created is compared to a string, it will be casted into
string, then compared as string, it become

cast(created, as string) > '2016-03-12 00:30:00+0000'

Could you try this

sqlCtx.sql("select created, cast(created as string) from rawTable").show()



>
>
>
> I get a different values for row count, max, Š
>
> If I convert the UTC time stamp string to my local timezone the row count
> matches the count returned by  cqlsh
>
> # pst works, matches cassandra cqlsh
> # .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000')
>
> Am I doing something wrong in my pyspark code?
>
>
> Kind regards
>
> Andy
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message