spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
Subject RE: how do you deal with datetime in Spark?
Date Tue, 03 Oct 2017 20:16:14 GMT
HA! Yeah in an earlier attempt, I tried to convert everything to unix_timestamp. That went
over like a lead ballon…

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


From: Steve Loughran [mailto:stevel@hortonworks.com]
Sent: Tuesday, October 3, 2017 2:19 PM
To: Adaryl Wakefield <adaryl.wakefield@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: how do you deal with datetime in Spark?


On 3 Oct 2017, at 18:43, Adaryl Wakefield <adaryl.wakefield@hotmail.com<mailto:adaryl.wakefield@hotmail.com>>
wrote:

I gave myself a project to start actually writing Spark programs. I’m using Scala and Spark
2.2.0. In my project, I had to do some grouping and filtering by dates. It was awful and took
forever. I was trying to use dataframes and SQL as much as possible. I see that there are
date functions in the dataframe API but trying to use them was frustrating. Even following
code samples was a headache because apparently the code is different depending on which version
of Spark you are using. I was really hoping for a rich set of date functions like you’d
find in T-SQL but I never really found them.

Is there a best practice for dealing with dates and time in Spark? I feel like taking a date/time
string and converting it to a date/time object and then manipulating data based on the various
components of the timestamp object (hour, day, year etc.) should be a heck of a lot easier
than what I’m finding and perhaps I’m just not looking in the right place.

You can see my work here: https://github.com/BobLovesData/Apache-Spark-In-24-Hours/blob/master/src/net/massstreet/hour10/BayAreaBikeAnalysis.scala


Once you've done that one, I have a few hundred MB of london bike stats if you wan then. Their
timestamps come in as strings, but "01/01/1970" is by far the most popular dropoff time, which
is 0 in the epoch...

9809600,0,6248,01/01/1970 00:00,0,NA,31/01/2012 19:31,365,City Road: Angel
9806201,0,6422,01/01/1970 00:00,0,NA,31/01/2012 19:32,17,Hatton Wall: Holborn
9802063,0,4096,01/01/1970 00:00,0,NA,31/01/2012 19:34,338,Wellington Street : Strand
9804765,0,5276,01/01/1970 00:00,0,NA,31/01/2012 19:37,93,Cloudesley Road: Angel
9806779,1970,14,31/01/2012 20:11,410,Edgware Road Station: Paddington
9813333,0,5810,01/01/1970 00:00,0,NA,31/01/2012 19:39,114,Park Road (Baker Street): Regent's
Park
9803952,0,5682,01/01/1970 00:00,0,NA,31/01/2012 19:41,210,Hinde Street: Marylebone
9818659,0,5572,01/01/1970 00:00,0,NA,31/01/2012 19:41,87,Devonshire Square: Liverpool Street
9808144,0,5244,01/01/1970 00:00,0,NA,31/01/2012 19:42,374,Waterloo Station 1: Waterloo
9814365,0,5422,01/01/1970 00:00,0,NA,31/01/2012 19:48,15,Great Russell Street: Bloomsbury
9816863,0,6079,01/01/1970 00:00,0,NA,31/01/2012 19:49,258,Kensington Gore: Knightsbridge
9818469,0,4903,01/01/1970 00:00,0,NA,31/01/2012 19:50,341,Craven Street: Strand
9811512,0,5572,01/01/1970 00:00,0,NA,31/01/2012 19:50,298,Curlew Street: Shad Thames
9817931,0,708,01/01/1970 00:00,0,NA,31/01/2012 19:51,341,Craven Street: Strand
9816429,0,3210,01/01/1970 00:00,0,NA,31/01/2012 19:59,388,Southampton Street: Strand
9806284,0,4359,01/01/1970 00:00,0,NA,31/01/2012 20:06,335,Tavistock Street: Covent Garden



Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>

Mime
View raw message