spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Huon.Wil...@data61.csiro.au>
Subject [SQL] 64-bit hash function, and seeding
Date Tue, 05 Mar 2019 04:30:31 GMT
Hi,

I’m working on something that requires deterministic randomness, i.e. a row gets the same
“random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect
way to do this, but the existing hashes have various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions for
many tables: the birthday paradox implies  >50% chance of at least one for tables larger
than 77000 rows)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with an arbitrary
number of arbitrary-typed columns: XxHash64, and exposing this for DataFrames seems like it’s
essentially one line in sql/functions.scala to match `hash` (plus docs, tests, function registry
etc.):

    def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like 

    val seed = rng.nextLong()
    hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64;
should I open a JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being seeded, but this
isn’t exposed directly and instead requires something like the `lit` above. Would it make
sense to add overloads like the following?

    def hash(seed: Int, cols: Columns*) = …
    def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

- Huon

 

Mime
View raw message