Hello:<= div class=3D"gmail_quote">

=C2=A0

I am using UDF to convert schema to JSON, and based on the JSO= N schema, when a schema has key =E2=80=9Ctype=E2=80=9D is =E2=80=9Cnumber= =E2=80=9D, I need to convert the input data to float, such as if an =E2=80= =9Cincome=E2=80=9D type is number, and the input data is =E2=80=9C100=E2=80= =9D, the output should be =E2=80=9C100.0=E2=80=9D. But the problem is if an= original number is an =E2=80=9Cinteger=E2=80=9D, the output will be null. = In the example above, the output is =E2=80=9Cnull=E2=80=9D

=C2=A0

Right now I have a temp= orary solution is that, traversing the schema and find all key =E2=80=9Ctyp= e=E2=80=9D is =E2=80=9Cnumber=E2=80=9D, store the key=E2=80=99s path from r= oot to this kay, into a list, and then traversing the input data, based on = the path list, convert the number to float each.=C2=A0

=

=C2=A0

=

But the algorithm=E2=80=99s pr= oblem is, when a key is =E2=80=9Citems=E2=80=9D, which means the value is n= ested in an array, there will be more than one numbers in the value, and th= ere will be more cases such as =E2=80=9Citems=E2=80=9D under =E2=80=9Citems= =E2=80=9D, or =E2=80=9Citems=E2=80=9D under =E2=80=9Cproperties=E2=80=9D. T= his algorithm cannot handle all the corner cases.

=C2=A0

So may I know can I get any sugg= estions that is there any other solutions that can help fixing the UDF inte= ger-float converting problem?

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

• We = have data records loaded into Python dictionaries from JSON, and some field= s (e.g. =E2=80=9Cincome=E2=80=9D) have mixed values =E2=80=93 in some recor= ds =E2=80=9Cincome=E2=80=9D is parsed as an integer (e.g. =E2=80=9C100=E2= =80=9D) and in some =E2=80=9Cincome=E2=80=9D is parsed as a float (e.g. =E2= =80=9C100.0=E2=80=9D)
• JSON { =E2=80=9Cincome=E2= =80=9D: =E2=80=9C100=E2=80=9D }, { =E2=80=9Cincome=E2=80=9D: =E2=80=9C100.0= =E2=80=9D } -> Python { =E2=80=9Cincome=E2=80=9D: 100 }, { =E2=80=9Cinco= me=E2=80=9D: 100.0}
• We= load these records as JSON strings into a dataframe, then we convert them = into StructType using pyspark.sql.functions.udf. The int/float mixed numeri= cal fields are marked as FloatType().
• Python { = =E2=80=9Cincome=E2=80=9D: 100 }, { =E2=80=9Cincome=E2=80=9D: 100.0} -> S= tructType(FloatType())
• We have observed that when PySpark 2.3 casts from =E2=80=9Cint=E2=80=9D to= =E2=80=9CFloatType=E2=80=9D, it coerces integer values like =E2=80=9C100= =E2=80=9D to =E2=80=9Cnull=E2=80=9D instead of to =E2=80=9C100.0=E2=80=9D.<= span>
• Observed: Python 100 -> FloatType null
• Desired: Python 100 -> FloatType 100= .0
• This behavior may also be tru= e in Scala.
• We are cur= rently trying to patch this problem by adding logic inside the Python funct= ion to recursively convert any integers to floats in Python before returnin= g from the UDF.
• We don=E2=80=99t= want to introduce this custom and error-prone logic.

=C2=A0

Have members of this c= ommunity encountered this issue in PySpark or in Scala? If so, how have you= solved it?

Is there a way in PySpark to enable implicit conversion of Python inte= gers (e.g. =E2=80=9C100=E2=80=9D) to PySpark FloatType (e.g. =E2=80=9C100.0= =E2=80=9D)?

=C2=A0

Thanks a lot!

Sincerely

Danni