flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1735) Add FeatureHasher to machine learning library
Date Thu, 07 May 2015 15:19:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532840#comment-14532840

Till Rohrmann commented on FLINK-1735:

I don't know of any usage for hashing a vector on a vector. Why would you do that instead
of doing some feature selection? But I can also be wrong if you know of a good use case.

> Add FeatureHasher to machine learning library
> ---------------------------------------------
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values.
The hash of the feature value is used to calculate its index for a vector entry. In order
to mitigate possible collisions, a second hashing function is used to calculate the sign for
the update value which is added to the vector entry. This way, it is likely that collision
will simply cancel out.
> A feature hasher would also be helpful for NLP problems where it could be used to vectorize
bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]

This message was sent by Atlassian JIRA

View raw message