spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aakash Basu <aakash.spark....@gmail.com>
Subject Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type
Date Thu, 17 Aug 2017 19:30:14 GMT
Hi Palwell,

Tried doing that, but its becoming null for all the dates after the
transformation with functions.

df2 = dflead.select('Enter_Date',f.to_date(df2.Enter_Date))


[image: Inline image 1]

Any insight?

Thanks,
Aakash.

On Fri, Aug 18, 2017 at 12:23 AM, Patrick Alwell <palwell@hortonworks.com>
wrote:

> Aakash,
>
> I’ve had similar issues with date-time formatting. Try using the functions
> library from pyspark.sql and the DF withColumns() method.
>
> ——————————————————————————————
>
> from pyspark.sql import functions as f
>
> lineitem_df = lineitem_df.withColumn('shipdate',f.to_date(lineitem_
> df.shipdate))
>
> ——————————————————————————————
>
> You should have first ingested the column as a string; and then leveraged
> the DF api to make the conversion to dateType.
>
> That should work.
>
> Kind Regards
>
> -Pat Alwell
>
>
> On Aug 17, 2017, at 11:48 AM, Aakash Basu <aakash.spark.raj@gmail.com>
> wrote:
>
> Hey all,
>
> Thanks! I had a discussion with the person who authored that package and
> informed about this bug, but in the meantime with the same thing, found a
> small tweak to ensure the job is done.
>
> Now that is fine, I'm getting the date as a string by predefining the
> Schema but I want to later convert it to a datetime format, which is making
> it this -
>
> >>> from pyspark.sql.functions import from_unixtime, unix_timestamp
> >>> df2 = dflead.select('Enter_Date', from_unixtime(unix_timestamp('Enter_Date',
> 'MM/dd/yyy')).alias('date'))
>
>
> >>> df2.show()
>
> <image.png>
>
> Which is not correct (as it is converting the 15 to 0015 instead of 2015.
> Do you guys think using the DateUtil package will solve this? Or any other
> solution with this built-in package?
>
> Please help!
>
> Thanks,
> Aakash.
>
> On Thu, Aug 17, 2017 at 12:01 AM, Jörn Franke <jornfranke@gmail.com>
> wrote:
>
>> You can use Apache POI DateUtil to convert double to Date (
>> https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html).
>> Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/h
>> adoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds.
>>
>> On 16. Aug 2017, at 20:15, Aakash Basu <aakash.spark.raj@gmail.com>
>> wrote:
>>
>> Hey Irving,
>>
>> Thanks for a quick revert. In Excel that column is purely string, I
>> actually want to import that as a String and later play around the DF to
>> convert it back to date type, but the API itself is not allowing me to
>> dynamically assign a Schema to the DF and I'm forced to inferSchema, where
>> itself, it is converting all numeric columns to double (Though, I don't
>> know how then the date column is getting converted to double if it is
>> string in the Excel source).
>>
>> Thanks,
>> Aakash.
>>
>>
>> On 16-Aug-2017 11:39 PM, "Irving Duran" <irving.duran@gmail.com> wrote:
>>
>> I think there is a difference between the actual value in the cell and
>> what Excel formats that cell.  You probably want to import that field as a
>> string or not have it as a date format in Excel.
>>
>> Just a thought....
>>
>>
>> Thank You,
>>
>> Irving Duran
>>
>> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu <aakash.spark.raj@gmail.com
>> > wrote:
>>
>>> Hey all,
>>>
>>> Forgot to attach the link to the overriding Schema through external
>>> package's discussion.
>>>
>>> https://github.com/crealytics/spark-excel/pull/13
>>>
>>> You can see my comment there too.
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu <
>>> aakash.spark.raj@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>>>> fetch data from an excel file using
>>>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>>>> double for a date type column.
>>>>
>>>> The detailed description is given here (the question I posted) -
>>>>
>>>> https://stackoverflow.com/questions/45713699/inferschema-usi
>>>> ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>>>
>>>>
>>>> Found it is a probable bug with the crealytics excel read package.
>>>>
>>>> Can somebody help me with a workaround for this?
>>>>
>>>> Thanks,
>>>> Aakash.
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message