spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3
Date Fri, 09 Oct 2015 04:39:26 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949883#comment-14949883
] 

Joseph K. Bradley commented on SPARK-10574:
-------------------------------------------

Hi [~simeons] Thanks for your thoughts, and sorry about the slow response!

1. It would be great to support any data type, though I think it would be OK to restrict ourselves
to a subset (e.g., Catalyst types) if necessary.

2. I agree it will be hard to support arbitrary types if we go beyond native hashing.  But
I haven't really heard of anyone using HashingTF for something which MurmurHash3 would not
handle.  I figure we can support a reasonable set of types for now and worry about supporting
fancy types later through pluggability (your 3rd comment).

3. I agree we should do this in the long-term.  Defining a trait, and providing a few default
implementations sounds good.  We could do this right away, or a little farther in the future
(in which case we could still provide MurmurHash3 + native hashing via a String option). 
The main reason to delay adding a trait would be the dev cost of making sure it works with
Scala and Java; we would need to add little unit tests.  For either option, we can start by
providing MurmurHash3 (as a default) and the old native hashing method.  Note that this means
changing the default behavior, but based on your tests from your gist, that seems pretty reasonable
(and users could choose to use the old hashing method as an option).

One question about your tests: Do you have a sense of how fast each hashing method is?  Is
one significantly slower?

Assuming MurmurHash3 is reasonably fast, I'd vote for an initial PR which provides setters
using Strings (no trait), with a follow-up PR to add a trait:
{code}
def setHash(method: String)
{code}

Does that sound good?

> HashingTF should use MurmurHash3
> --------------------------------
>
>                 Key: SPARK-10574
>                 URL: https://issues.apache.org/jira/browse/SPARK-10574
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Priority: Critical
>              Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are two significant
problems with this.
> First, per the [Scala documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any]
for {{hashCode}}, the implementation is platform specific. This means that feature vectors
created on one platform may be different than vectors created on another platform. This can
create significant problems when a model trained offline is used in another environment for
online prediction. The problem is made harder by the fact that following a hashing transform
features lose human-tractable meaning and a problem such as this may be extremely difficult
to track down.
> Second, the native Scala hashing function performs badly on longer strings, exhibiting
[200-500% higher collision rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than,
for example, [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
which is also included in the standard Scala libraries and is the hashing choice of fast learners
such as Vowpal Wabbit, scikit-learn and others. If Spark users apply {{HashingTF}} only to
very short, dictionary-like strings the hashing function choice will not be a big problem
but why have an implementation in MLlib with this limitation when there is a better implementation
readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that this is a good
change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a previous version
would have to be re-trained. This introduces a problem that's orthogonal to breaking changes
in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous
version. Is there a policy or best practice currently in effect about this? If not, perhaps
we should come up with a few simple rules about how we communicate these in release notes,
etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message