flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1735) Add FeatureHasher to machine learning library
Date Sat, 09 May 2015 18:44:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536825#comment-14536825

ASF GitHub Bot commented on FLINK-1735:

GitHub user ChristophAl opened a pull request:


    [FLINK-1735] Feature Hasher

    The prototype of the feature hasher.
    - The implementation is based on the scikit-learn feature hasher
    - Test vectors have been generated by scikit-learn as well
    - Currently the implementation only works on Seq[String]

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ChristophAl/flink FLINK-1735_FeatureHasher

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #665
commit e5ad7e842f443dd4b15fe21f3d1d89c238c882d1
Author: Christoph Alt <christoph.alt@posteo.de>
Date:   2015-05-06T22:10:24Z

    Initial commit Issue #1735

commit 1e9312fdc46b741faea6bdfb26fc4ce359cd1cfa
Author: Christoph Alt <christoph.alt@posteo.de>
Date:   2015-05-08T13:54:53Z

    Added basic testcase for FeatureHasher

commit a0c6ee6251edc4d0e556ba98886a783a072bd27b
Author: Christoph Alt <christoph.alt@posteo.de>
Date:   2015-05-08T13:58:59Z

    FeatureHasher prototype
    - Added a prototype of Feature Hasher, currently accepts Seq[String] only

commit c55eb11fa21943dd8451256755bc707a59c3f5d3
Author: Christoph Alt <christoph.alt@posteo.de>
Date:   2015-05-08T14:09:48Z

    Corrected typos

commit 7002ab9e18a6cca5b55d700967accb375538faad
Author: Christoph Alt <christoph.alt@posteo.de>
Date:   2015-05-09T14:25:42Z

    Moved Featurehasher to feature.extraction package

commit 15b868f08806b375fff564f851f668122d363457
Author: Christoph Alt <christoph.alt@posteo.de>
Date:   2015-05-09T14:31:19Z

    Readded FeatureHasher.scala

commit 38e0650ebdec305c4a51e788699da0809a3b1973
Author: Christoph Alt <christoph.alt@posteo.de>
Date:   2015-05-09T18:36:00Z

    Reformated test vectors


> Add FeatureHasher to machine learning library
> ---------------------------------------------
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature values.
The hash of the feature value is used to calculate its index for a vector entry. In order
to mitigate possible collisions, a second hashing function is used to calculate the sign for
the update value which is added to the vector entry. This way, it is likely that collision
will simply cancel out.
> A feature hasher would also be helpful for NLP problems where it could be used to vectorize
bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]

This message was sent by Atlassian JIRA

View raw message