spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danni Wu <anniea...@gmail.com>
Subject Seeking help of UDF number-float converting
Date Mon, 01 Jul 2019 22:25:11 GMT
Hello:



I am using UDF to convert schema to JSON, and based on the JSON schema,
when a schema has key “type” is “number”, I need to convert the input data
to float, such as if an “income” type is number, and the input data is
“100”, the output should be “100.0”. But the problem is if an original
number is an “integer”, the output will be null. In the example above, the
output is “null”



Right now I have a temporary solution is that, traversing the schema and
find all key “type” is “number”, store the key’s path from root to this
kay, into a list, and then traversing the input data, based on the path
list, convert the number to float each.



But the algorithm’s problem is, when a key is “items”, which means the
value is nested in an array, there will be more than one numbers in the
value, and there will be more cases such as “items” under “items”, or
“items” under “properties”. This algorithm cannot handle all the corner
cases.



So may I know can I get any suggestions that is there any other solutions
that can help fixing the UDF integer-float converting problem?




=======================

To add context:

   - We have data records loaded into Python dictionaries from JSON, and
   some fields (e.g. “income”) have mixed values – in some records “income” is
   parsed as an integer (e.g. “100”) and in some “income” is parsed as a float
   (e.g. “100.0”)
      - JSON { “income”: “100” }, { “income”: “100.0” } -> Python {
      “income”: 100 }, { “income”: 100.0}
   - We load these records as JSON strings into a dataframe, then we
   convert them into StructType using pyspark.sql.functions.udf. The int/float
   mixed numerical fields are marked as FloatType().
      - Python { “income”: 100 }, { “income”: 100.0} ->
      StructType(FloatType())
   - We have observed that when PySpark 2.3 casts from “int” to
   “FloatType”, it coerces integer values like “100” to “null” instead of to
   “100.0”.
      - Observed: Python 100 -> FloatType null
      - Desired: Python 100 -> FloatType 100.0
      - This behavior may also be true in Scala.
   - We are currently trying to patch this problem by adding logic inside
   the Python function to recursively convert any integers to floats in Python
   before returning from the UDF.
   - We don’t want to introduce this custom and error-prone logic.



Have members of this community encountered this issue in PySpark or in
Scala? If so, how have you solved it?

Is there a way in PySpark to enable implicit conversion of Python integers
(e.g. “100”) to PySpark FloatType (e.g. “100.0”)?







Thanks a lot!

Sincerely

Danni

Mime
View raw message