spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type
Date Wed, 16 Aug 2017 18:31:15 GMT
You can use Apache POI DateUtil to convert double to Date (https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html).
Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/hadoopoffice/wiki), it
supports Spark 1.x or Spark 2.0 ds.

> On 16. Aug 2017, at 20:15, Aakash Basu <aakash.spark.raj@gmail.com> wrote:
> 
> Hey Irving,
> 
> Thanks for a quick revert. In Excel that column is purely string, I actually want to
import that as a String and later play around the DF to convert it back to date type, but
the API itself is not allowing me to dynamically assign a Schema to the DF and I'm forced
to inferSchema, where itself, it is converting all numeric columns to double (Though, I don't
know how then the date column is getting converted to double if it is string in the Excel
source).
> 
> Thanks,
> Aakash.
> 
> 
> On 16-Aug-2017 11:39 PM, "Irving Duran" <irving.duran@gmail.com> wrote:
> I think there is a difference between the actual value in the cell and what Excel formats
that cell.  You probably want to import that field as a string or not have it as a date format
in Excel.
> 
> Just a thought....
> 
> 
> Thank You,
> 
> Irving Duran
> 
>> On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu <aakash.spark.raj@gmail.com>
wrote:
>> Hey all,
>> 
>> Forgot to attach the link to the overriding Schema through external package's discussion.
>> 
>> https://github.com/crealytics/spark-excel/pull/13
>> 
>> You can see my comment there too.
>> 
>> Thanks,
>> Aakash.
>> 
>>> On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu <aakash.spark.raj@gmail.com>
wrote:
>>> Hi all,
>>> 
>>> I am working on PySpark (Python 3.6 and Spark 2.1.1) and trying to fetch data
from an excel file using spark.read.format("com.crealytics.spark.excel"), but it is inferring
double for a date type column.
>>> 
>>> The detailed description is given here (the question I posted) -
>>> 
>>> https://stackoverflow.com/questions/45713699/inferschema-using-spark-read-formatcom-crealytics-spark-excel-is-inferring-d
>>> 
>>> 
>>> Found it is a probable bug with the crealytics excel read package.
>>> 
>>> Can somebody help me with a workaround for this?
>>> 
>>> Thanks,
>>> Aakash.
>> 
> 
> 

Mime
View raw message