[ https://issues.apache.org/jira/browse/SPARK-25195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589850#comment-16589850
]
Maxim Gekk commented on SPARK-25195:
------------------------------------
> 1. Does this patch also solve problem 2, as described above?
No, it doesn't.
> 2. Do you know when it will be released?
It should be in the upcoming release 2.4.
> Extending from_json function
> ----------------------------
>
> Key: SPARK-25195
> URL: https://issues.apache.org/jira/browse/SPARK-25195
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, Spark Core
> Affects Versions: 2.3.1
> Reporter: Yuriy Davygora
> Priority: Minor
>
> Dear Spark and PySpark maintainers,
> I hope, that opening a JIRA issue is the correct way to request an improvement. If
it's not, please forgive me and kindly instruct me on how to do it instead.
> At our company, we are currently rewriting a lot of old MapReduce code with SPARK,
and the following use-case is quite frequent: Some string-valued dataframe columns are JSON-arrays,
and we want to parse them into array-typed columns.
> Problem number 1: The from_json function accepts as a schema only StructType or ArrayType(StructType),
but not an ArrayType of primitives. Submitting the schema in a string form like {noformat}{"containsNull":true,"elementType":"string","type":"array"}{noformat}
does not work either, the error message says, among other things, {noformat}data type mismatch:
Input schema array<string> must be a struct or an array of structs.{noformat}
> Problem number 2: Sometimes, in our JSON arrays we have elements of different types.
For example, we might have some JSON array like {noformat}["string_value", 0, true, null]{noformat}
which is JSON-valid with schema {noformat}{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}{noformat}
(and, for instance the Python json.loads function has no problem parsing this), but such a
schema is not recognized, at all. The error message gets quite unreadable after the words
{noformat}ParseException: u'\nmismatched input{noformat}
> Here is some simple Python code to reproduce the problems (using pyspark 2.3.1 and
pandas 0.23.4):
> {noformat}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> from pyspark.sql.types import StringType, ArrayType
> spark = SparkSession.builder.appName('test').getOrCreate()
> data = {'id' : [1,2,3], 'data' : ['["string1", true, null]', '["string2", false, null]',
'["string3", true, "another_string3"]']}
> pdf = pd.DataFrame.from_dict(data)
> df = spark.createDataFrame(pdf)
> df.show()
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> ArrayType(StringType()))) # Does not work, because not a struct of array of structs
> df = df.withColumn("parsed_data", F.from_json(F.col('data'),
> '{"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}'))
# Does not work at all
> {noformat}
> For now, we have to use a UDF function, which calls python's json.loads, but this is,
for obvious reasons, suboptimal. If you could extend the functionality of the Spark from_json
function in the next release, this would be really helpful. Thank you in advance!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org
|