flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1735) Add FeatureHasher to machine learning library
Date Thu, 07 May 2015 08:40:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532251#comment-14532251
] 

Till Rohrmann commented on FLINK-1735:
--------------------------------------

Hi [~Felix Neutatz],

great to see that you have already implemented a feature hasher.

Well the feature hasher should maybe go into a `feature.extraction` package. I'll move the
`PolynomialBase` transformer in the `preprocessing` package where he belongs to.

There is no test data for the feature hasher. Thus, you should create some test data.

Usually the result of the feature hasher is really sparse, otherwise you have selected the
number of features too little and thus your feature vectors won't be meaningful at all. However,
one could also think about a threshold which defines how many entries have to be non-zero
in order for the vector to be stored in a `DenseVector`. If the threshold is not exceeded
then a `SparseVector` is used.

I have some comments to your implementation: The idea of the feature hasher is that you transform
non-numerical data (image, text) into a numerical representation. Thus, defining the `FeatureHasher`
as a `Transformer[Vector, Vector]` is not really useful. It would be better to define it for
textual input or introducing a type parameter there. 

For further comments, it would be good to open a PR, then I can directly comment on the code.

> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values.
The hash of the feature value is used to calculate its index for a vector entry. In order
to mitigate possible collisions, a second hashing function is used to calculate the sign for
the update value which is added to the vector entry. This way, it is likely that collision
will simply cancel out.
> A feature hasher would also be helpful for NLP problems where it could be used to vectorize
bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message