spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Huon.Wil...@data61.csiro.au>
Subject Re: [SQL] 64-bit hash function, and seeding
Date Tue, 05 Mar 2019 23:33:35 GMT
Hi Nicolas,

On 6/3/19, 7:48 am, "Nicolas Paris" <nicolas.paris@riseup.net> wrote:

    Hi Huon
    
    Good catch. A 64 bit hash is definitely a useful function.
    
    > the birthday paradox implies  >50% chance of at least one for tables larger than
77000 rows
    
    Do you know how many rows to have 50% chances for a 64 bit hash ?

5 billion: it's roughly equal to the square root of the total number of possible hash values.
You can see detailed table at https://en.wikipedia.org/wiki/Birthday_problem#Probability_table
.

Note, for my application a few collisions is fine. There's a few ways of trying to quantify
this, and one is the maximum number of items that all hash to a single particular hash value:
if one has 4 billion rows with 32-bit hash, the size of this largest set is likely to be 14
(and, there's going to be many other smaller sets of colliding values). With a 64-bit hash,
it is likely to be 2, and the table size can be as large as ~8 trillion before the expected
maximum exceeds 3. (https://en.wikipedia.org/wiki/Balls_into_bins_problem#Random_allocation)

Another way is the expected number of collisions, for the three cases above it is 1.6 billion
(32-bit hash, 4 billion rows), 0.5 (64-bit, 4 billion), and 2.1 million (64-bit, 8 trillion).
(http://matt.might.net/articles/counting-hash-collisions/)
    
    About the seed column, to me there is no need for such an argument: you
    just can add an integer as a regular column.
   
You are correct that this works, but it increases the amount of computation (doubles it, when
just trying to hash a single column). For multiple columns, col1, col2, ... colN, the `hash`
function works approximately like (in pseudo-scala, and simplified from Spark's actual implementation):

val InitialSeed = 42L
def hash(col1, col2, ..., colN) = {
  var value = InitialSeed
  value = hashColumn(col1, seed = value)
  value = hashColumn(col2, seed = value)
  ...
  value = hashColumn(colN, seed = value)
  return value
}

If that starting value can be customized, then a call like `hash(lit(mySeed), column)` (which
has to do the work to hash two columns) can be changed to instead just start at `mySeed`,
and only hash one column. That said, for the hashes spark uses (xxHash and MurmurHash3), the
hashing operation isn't too expensive, especially for ints/longs.

Huon
 
    About the process for pull requests, I cannot help much
    
    
    On Tue, Mar 05, 2019 at 04:30:31AM +0000, Huon.Wilson@data61.csiro.au wrote:
    > Hi,
    > 
    > I’m working on something that requires deterministic randomness, i.e. a row gets
the same “random” value no matter the order of the DataFrame. A seeded hash seems to be
the perfect way to do this, but the existing hashes have various limitations:
    > 
    > - hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions
for many tables: the birthday paradox implies  >50% chance of at least one for tables larger
than 77000 rows)
    > - sha1/sha2/md5: single binary column input, string output
    > 
    > It seems there’s already support for a 64-bit hash function that can work with
an arbitrary number of arbitrary-typed columns: XxHash64, and exposing this for DataFrames
seems like it’s essentially one line in sql/functions.scala to match `hash` (plus docs,
tests, function registry etc.):
    > 
    >     def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr))
}
    > 
    > For my use case, this can then be used to get a 64-bit “random” column like 
    > 
    >     val seed = rng.nextLong()
    >     hash64(lit(seed), col1, col2)
    > 
    > I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64;
should I open a JIRA and submit it as a pull request?
    > 
    > Additionally, both hash and the new hash64 already have support for being seeded,
but this isn’t exposed directly and instead requires something like the `lit` above. Would
it make sense to add overloads like the following?
    > 
    >     def hash(seed: Int, cols: Columns*) = …
    >     def hash64(seed: Long, cols: Columns*) = …
    > 
    > Though, it does seem a bit unfortunate to be forced to pass the seed first.
    > 
    > - Huon
    > 
    >  
    > 
    
    
    > 
    > ---------------------------------------------------------------------
    > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
    
    
    -- 
    nicolas
    
    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org
    
    

Mime
View raw message