spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: Query regarding infering data types in pyspark
Date Mon, 13 Apr 2015 23:29:27 GMT
Hey Suraj,

You should use "date" for DataType:

df.withColumn(df.DateCol.cast("date"))

Davies

On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya <surajshetiya@gmail.com> wrote:
> Humble reminder
>
> On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya <surajshetiya@gmail.com>
> wrote:
>>
>> Hi,
>>
>> Below is one line from the json file.
>> I have highlighted the field that represents the date.
>>
>>
>> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,"FL_DATE":"2015-01-31","UNIQUE_CARRIER":"NK","AI
>> RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
>> 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
>> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
>> GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
>> DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
>> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45    ,"DEST_STATE_NM":"South
>> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
>> :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
>> :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed:
>> 47":null}
>>
>> Please let me know if you need access to the dataset.
>>
>> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu <davies@databricks.com>
>> wrote:
>>>
>>> What's the format you have in json file?
>>>
>>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya <surajshetiya@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > In pyspark when if I read a json file using sqlcontext I find that the
>>> > date
>>> > field is not infered as date instead it is converted to string. And
>>> > when I
>>> > try to convert it to date using
>>> > df.withColumn(df.DateCol.cast("timestamp"))
>>> > it does not parse it successfuly and adds a null instead there. Should
>>> > I
>>> > use UDF to convert the date ? Is this expected behaviour (not throwing
>>> > an
>>> > error after failure to cast all fields)?
>>> >
>>> > --
>>> > Regards,
>>> > Suraj
>>
>>
>>
>>
>> --
>> Regards,
>> Suraj
>
>
>
>
> --
> Regards,
> Suraj

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message