spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Kahn (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-30239) [Python] Creating a dataframe with Pandas rather than Numpy datatypes fails
Date Thu, 12 Dec 2019 19:41:00 GMT
Philip Kahn created SPARK-30239:
-----------------------------------

             Summary: [Python] Creating a dataframe with Pandas rather than Numpy datatypes
fails
                 Key: SPARK-30239
                 URL: https://issues.apache.org/jira/browse/SPARK-30239
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.3
         Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 | Scala 2.11
            Reporter: Philip Kahn


It's possible to work with DataFrames in Pandas and shuffle them back over to Spark dataframes
for processing; however, using Pandas extended datatypes like {{Int64 }}( [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] )
throws an error (that long / float can't be converted).

This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} allows only integers
except for the single float value {{np.nan}}.

 

The current workaround for this is to use the columns as floats, and after conversion to the
Spark DataFrame, to recast the column as {{LongType()}}. For example:

 

{{sdfC = spark.createDataFrame(kgridCLinked)}}

{{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}

 

However, this is awkward and redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message