spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Issue with UDF Int Conversion - Str to Int
Date Mon, 23 Mar 2020 05:13:42 GMT
Hi

I am trying to implement simple hashing/checksum logic. The key logic is -

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------










*from pyspark.sql.functions import *from pyspark.sql.types import *from
hashlib import sha1 as local_sha1df = spark.sql("select '4104003141'
value_to_hash union all  select '4102859263'")f1 = lambda x:
str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))f2 = lambda x:
int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)sha2Int1 = udf( f1 ,
StringType())sha2Int2 = udf( f2 , IntegerType())print(f('4102859263'))dfr =
df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'),
sha2Int2(df.value_to_hash).alias('2'))*
*dfr.show(truncate=False)*
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however
thats not the case *"always" *

2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 |
+-------------+----------+-----------+ |4104003141 |478797741
|478797741 | |4102859263
|2520346415|-1774620881| +-------------+----------+-----------+

The function working fine, as shown in the print statement. However values
are not matching and vary widely.

Any pointer?

-- 
Best Regards,
Ayan Guha

Mime
View raw message