spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (Jira)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-29188) toPandas gets wrong dtypes when applied on empty DF
Date Thu, 12 Dec 2019 12:00:27 GMT

     [ https://issues.apache.org/jira/browse/SPARK-29188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon reassigned SPARK-29188:
------------------------------------

    Assignee: David Lindelöf

> toPandas gets wrong dtypes when applied on empty DF
> ---------------------------------------------------
>
>                 Key: SPARK-29188
>                 URL: https://issues.apache.org/jira/browse/SPARK-29188
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0, 2.4.4
>         Environment: >> uname -a
> Linux XXXXXXXXXXXXXXXX 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019
x86_64 GNU/Linux
> >> python
> Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42)
> [GCC 7.3.0] on linux
> >> conda list
> ...
> openjdk   8.0.192   h1de35cc_1003       conda-forge
> pandas    0.25.1      py36h86efe34_0    conda-forge
> py4j         0.10.7      py_1                           conda-forge
> pyspark   2.4.4       py_0                          conda-forge
> ....
>            Reporter: Radhwane Chebaane
>            Assignee: David Lindelöf
>            Priority: Major
>             Fix For: 3.0.0
>
>
> When calling toPandas from an empty dataframe, all dtypes are set to `object`.
> {code:python}
> spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())],
["age", "name", "date"])
> spark.createDataFrame(spark.sparkContext.emptyRDD(), schema=spark_df.schema).toPandas().dtypes 
> {code}
> Result: 
> {code:bash}
> age     object
> name    object
> date    object
> dtype: object
> {code}
>  
> While it gets the correct types when converting the entire dataframe (or at least with
1 line of data) to pandas:
> {code:python}
> spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())],
["age", "name", "date"]) 
> spark_df.limit(1).toPandas().dtypes 
> {code}
>  Result:
> {code:bash}
> age              int64
> name            object
> date    datetime64[ns]
> dtype: object
> {code}
>  
> Is this intended ? Why toPandas does not rely on the Spark DataFrame Schema ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message