spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrico Minack <m...@Enrico.Minack.dev>
Subject Re: Issue with UDF Int Conversion - Str to Int
Date Mon, 23 Mar 2020 14:23:04 GMT
Ayan,

no need for UDFs, the SQL API provides all you need (sha1, substring, conv):
https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html

 >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 
10).cast("long").alias("sha2long")).show()
+----------+
|  sha2long|
+----------+
| 478797741|
|2520346415|
+----------+

This creates a lean query plan:

 >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 
10).cast("long").alias("sha2long")).explain()
== Physical Plan ==
Union
:- *(1) Project [478797741 AS sha2long#74L]
:  +- Scan OneRowRelation[]
+- *(2) Project [2520346415 AS sha2long#76L]
    +- Scan OneRowRelation[]


Enrico


Am 23.03.20 um 06:13 schrieb ayan guha:
> Hi
>
> I am trying to implement simple hashing/checksum logic. The key logic 
> is -
>
> 1. Generate sha1 hash
> 2. Extract last 8 chars
> 3. Convert 8 chars to Int (using base 16)
>
> Here is the cut down version of the code:
>
> ---------------------------------------------------------------------------------------
> /from pyspark.sql.functions import *
> from pyspark.sql.types import *
> from hashlib import sha1 as local_sha1
> df = spark.sql("select '4104003141' value_to_hash union all  select 
> '4102859263'")
> f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
> f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
> sha2Int1 = udf( f1 , StringType())
> sha2Int2 = udf( f2 , IntegerType())
> print(f('4102859263'))
> dfr = df.select(df.value_to_hash, 
> sha2Int1(df.value_to_hash).alias('1'), 
> sha2Int2(df.value_to_hash).alias('2'))
> /
> /dfr.show(truncate=False)/
> ---------------------------------------------------------------------------------------------
>
> I was expecting both columns should provide exact same values, however 
> thats not the case *"always" *
> *
> *
> 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 
> | +-------------+----------+-----------+ |4104003141 |478797741 
> |478797741 | |4102859263 
> |2520346415|-1774620881|+-------------+----------+-----------+ *
> *
>
> The function working fine, as shown in the print statement. However 
> values are not matching and vary widely.
>
> Any pointer?
>
> -- 
> Best Regards,
> Ayan Guha



Mime
View raw message