spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aakash Basu <aakash.spark....@gmail.com>
Subject Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type
Date Thu, 17 Aug 2017 18:48:20 GMT
Hey all,

Thanks! I had a discussion with the person who authored that package and
informed about this bug, but in the meantime with the same thing, found a
small tweak to ensure the job is done.

Now that is fine, I'm getting the date as a string by predefining the
Schema but I want to later convert it to a datetime format, which is making
it this -

>>> from pyspark.sql.functions import from_unixtime, unix_timestamp
>>> df2 = dflead.select('Enter_Date',
from_unixtime(unix_timestamp('Enter_Date', 'MM/dd/yyy')).alias('date'))


>>> df2.show()

[image: Inline image 1]

Which is not correct (as it is converting the 15 to 0015 instead of 2015.
Do you guys think using the DateUtil package will solve this? Or any other
solution with this built-in package?

Please help!

Thanks,
Aakash.

On Thu, Aug 17, 2017 at 12:01 AM, Jörn Franke <jornfranke@gmail.com> wrote:

> You can use Apache POI DateUtil to convert double to Date (
> https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html).
> Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/
> hadoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds.
>
> On 16. Aug 2017, at 20:15, Aakash Basu <aakash.spark.raj@gmail.com> wrote:
>
> Hey Irving,
>
> Thanks for a quick revert. In Excel that column is purely string, I
> actually want to import that as a String and later play around the DF to
> convert it back to date type, but the API itself is not allowing me to
> dynamically assign a Schema to the DF and I'm forced to inferSchema, where
> itself, it is converting all numeric columns to double (Though, I don't
> know how then the date column is getting converted to double if it is
> string in the Excel source).
>
> Thanks,
> Aakash.
>
>
> On 16-Aug-2017 11:39 PM, "Irving Duran" <irving.duran@gmail.com> wrote:
>
> I think there is a difference between the actual value in the cell and
> what Excel formats that cell.  You probably want to import that field as a
> string or not have it as a date format in Excel.
>
> Just a thought....
>
>
> Thank You,
>
> Irving Duran
>
> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu <aakash.spark.raj@gmail.com>
> wrote:
>
>> Hey all,
>>
>> Forgot to attach the link to the overriding Schema through external
>> package's discussion.
>>
>> https://github.com/crealytics/spark-excel/pull/13
>>
>> You can see my comment there too.
>>
>> Thanks,
>> Aakash.
>>
>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu <aakash.spark.raj@gmail.com
>> > wrote:
>>
>>> Hi all,
>>>
>>> I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to
>>> fetch data from an excel file using
>>> *spark.read.format("com.crealytics.spark.excel")*, but it is inferring
>>> double for a date type column.
>>>
>>> The detailed description is given here (the question I posted) -
>>>
>>> https://stackoverflow.com/questions/45713699/inferschema-usi
>>> ng-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>>
>>>
>>> Found it is a probable bug with the crealytics excel read package.
>>>
>>> Can somebody help me with a workaround for this?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>
>>
>
>

Mime
View raw message