spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-26240) [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException
Date Wed, 13 Feb 2019 03:04:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-26240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-26240.
----------------------------------
    Resolution: Incomplete

Resolving this due to no feedback from reporter.

> [pyspark] Updating illegal column names with withColumnRenamed does not change schema
changes, causing pyspark.sql.utils.AnalysisException
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26240
>                 URL: https://issues.apache.org/jira/browse/SPARK-26240
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>         Environment: Ubuntu 16.04 LTS (x86_64/deb)
>  
>            Reporter: Ying Wang
>            Priority: Major
>
> I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet file with
illegal column headers, and when I had called df = df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME)
and then called df.show(), pyspark errored out with the failed attribute being the old column
name.
> Steps to reproduce:
> - Create a Parquet file from Pandas using this dataframe schema:
> ```python
> In [10]: df.info()
> <class 'pandas.core.frame.DataFrame'>
> Int64Index: 1000 entries, 0 to 999
> Data columns (total 16 columns):
> Record_ID 1000 non-null int64
> registration_dttm 1000 non-null object
> id 1000 non-null int64
> first_name 984 non-null object
> last_name 1000 non-null object
> email 984 non-null object
> gender 933 non-null object
> ip_address 1000 non-null object
> cc 709 non-null float64
> country 1000 non-null object
> birthdate 803 non-null object
> salary 932 non-null float64
> title 803 non-null object
> comments 179 non-null object
> Unnamed: 14 10 non-null object
> Unnamed: 15 9 non-null object
> dtypes: float64(2), int64(2), object(12)
> memory usage: 132.8+ KB
> ```
>  * Open pyspark shell with `pyspark` and read in the Parquet file with `spark.read.format('parquet').load('/path/to/file.parquet')
> Call `spark_df.show()` Note the error with column 'Unnamed: 14'.
> Rename column, replacing illegal space character with underscore character: `spark_df
= spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')`
> Call `spark_df.show()` again, and note that the error still shows attribute 'Unnamed:
14' in the error message:
> ```python
> >>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet')
> >>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')
> >>> newdf.show()
> Traceback (most recent call last):
>  File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py",
line 63, in deco
>  return f(*a, **kw)
>  File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString.
> : org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" contains invalid
character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> ...
> ```
> I would have thought that there would be a way in order to read in Parquet files such
that illegal column names can be changed after the fact with the spark dataframe was generated,
and thus this is unintended behavior. Please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message