spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nicolas.pa...@riseup.net>
Subject Re: [SQL] 64-bit hash function, and seeding
Date Tue, 05 Mar 2019 20:47:51 GMT
Hi Huon

Good catch. A 64 bit hash is definitely a useful function.

> the birthday paradox implies  >50% chance of at least one for tables larger than 77000
rows

Do you know how many rows to have 50% chances for a 64 bit hash ?


About the seed column, to me there is no need for such an argument: you
just can add an integer as a regular column.

About the process for pull requests, I cannot help much


On Tue, Mar 05, 2019 at 04:30:31AM +0000, Huon.Wilson@data61.csiro.au wrote:
> Hi,
> 
> I’m working on something that requires deterministic randomness, i.e. a row gets the
same “random” value no matter the order of the DataFrame. A seeded hash seems to be the
perfect way to do this, but the existing hashes have various limitations:
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions
for many tables: the birthday paradox implies  >50% chance of at least one for tables larger
than 77000 rows)
> - sha1/sha2/md5: single binary column input, string output
> 
> It seems there’s already support for a 64-bit hash function that can work with an arbitrary
number of arbitrary-typed columns: XxHash64, and exposing this for DataFrames seems like it’s
essentially one line in sql/functions.scala to match `hash` (plus docs, tests, function registry
etc.):
> 
>     def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }
> 
> For my use case, this can then be used to get a 64-bit “random” column like 
> 
>     val seed = rng.nextLong()
>     hash64(lit(seed), col1, col2)
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64;
should I open a JIRA and submit it as a pull request?
> 
> Additionally, both hash and the new hash64 already have support for being seeded, but
this isn’t exposed directly and instead requires something like the `lit` above. Would it
make sense to add overloads like the following?
> 
>     def hash(seed: Int, cols: Columns*) = …
>     def hash64(seed: Long, cols: Columns*) = …
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed first.
> 
> - Huon
> 
>  
> 


> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org


-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message