spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Huon.Wil...@data61.csiro.au>
Subject Re: [SQL] hash: 64-bits and seeding
Date Fri, 08 Mar 2019 03:14:56 GMT
Thanks for the guidance. That was my initial inclination, but I decided that consistency with
the existing ‘hash’ was better. However, like you, I also prefer the specific form.

I’ve opened https://issues.apache.org/jira/browse/SPARK-27099 and submitted the patch (using
‘xxhash64’) at https://github.com/apache/spark/pull/24019.

- Huon

From: Reynold Xin <rxin@databricks.com>
Date: Thursday, 7 March 2019 at 6:33 pm
To: "Wilson, Huon (Data61, Eveleigh ATP)" <Huon.Wilson@data61.csiro.au>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: [SQL] hash: 64-bits and seeding


Rather than calling it hash64, it'd be better to just call it xxhash64. The reason being ten
years from now, we probably would look back and laugh at a specific hash implementation. It'd
be better to just name the expression what it is.


On Wed, Mar 06, 2019 at 7:59 PM, <Huon.Wilson@data61.csiro.au<mailto:Huon.Wilson@data61.csiro.au>>
wrote:

Hi,

I’m working on something that requires deterministic randomness, i.e. a row gets the same
“random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect
way to do this, but the existing hashes have various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions for
many tables: the birthday paradox implies >50% chance of at least one for tables larger
than 77000 rows, and likely ~1.6 billion collisions in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with an arbitrary
number of arbitrary-typed columns (XxHash64), and exposing this for DataFrames seems like
it’s essentially one line in sql/functions.scala to match `hash` (plus docs, tests, function
registry etc.):

def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

val seed = rng.nextLong()
hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64;
should I open a JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being seeded, but this
isn’t exposed directly and instead requires something like the `lit` above. Would it make
sense to add overloads like the following?

def hash(seed: Int, cols: Columns*) = …
def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to user@spark.apache.org<mailto:user@spark.apache.org> a few days
ago, but didn't get any discussion about the Spark aspects of this, so I'm resending it here;
I apologise in advance if I'm breaking protocol!)

- Huon Wilson

--------------------------------------------------------------------- To unsubscribe e-mail:
dev-unsubscribe@spark.apache.org<mailto:dev-unsubscribe@spark.apache.org>

Mime
View raw message