flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Felix Neutatz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1735) Add FeatureHasher to machine learning library
Date Thu, 07 May 2015 16:32:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532954#comment-14532954
] 

Felix Neutatz commented on FLINK-1735:
--------------------------------------

I guess you are right - I put the version with Seq[String] into a pull request. We still have
to think about a nice test case and how to generalize it

> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values.
The hash of the feature value is used to calculate its index for a vector entry. In order
to mitigate possible collisions, a second hashing function is used to calculate the sign for
the update value which is added to the vector entry. This way, it is likely that collision
will simply cancel out.
> A feature hasher would also be helpful for NLP problems where it could be used to vectorize
bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message