spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simeon Simeonov (JIRA)" <>
Subject [jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3
Date Thu, 01 Oct 2015 14:11:27 GMT


Simeon Simeonov commented on SPARK-10574:

[~josephkb] any thoughts on the above?

> HashingTF should use MurmurHash3
> --------------------------------
>                 Key: SPARK-10574
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Priority: Critical
>              Labels: HashingTF, hashing, mllib
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are two significant
problems with this.
> First, per the [Scala documentation|]
for {{hashCode}}, the implementation is platform specific. This means that feature vectors
created on one platform may be different than vectors created on another platform. This can
create significant problems when a model trained offline is used in another environment for
online prediction. The problem is made harder by the fact that following a hashing transform
features lose human-tractable meaning and a problem such as this may be extremely difficult
to track down.
> Second, the native Scala hashing function performs badly on longer strings, exhibiting
[200-500% higher collision rates|] than,
for example, [MurmurHash3|$]
which is also included in the standard Scala libraries and is the hashing choice of fast learners
such as Vowpal Wabbit, scikit-learn and others. If Spark users apply {{HashingTF}} only to
very short, dictionary-like strings the hashing function choice will not be a big problem
but why have an implementation in MLlib with this limitation when there is a better implementation
readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that this is a good
change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a previous version
would have to be re-trained. This introduces a problem that's orthogonal to breaking changes
in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous
version. Is there a policy or best practice currently in effect about this? If not, perhaps
we should come up with a few simple rules about how we communicate these in release notes,

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message