flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ronny Bräunlich (JIRA) <j...@apache.org>
Subject [jira] [Commented] (FLINK-1999) TF-IDF transformer
Date Mon, 11 May 2015 11:13:59 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537822#comment-14537822
] 

Ronny Bräunlich commented on FLINK-1999:
----------------------------------------

Are you sure that the input type should be DataSet[Seq[String]]?
That seems to me like we would calculate the idf always for one document, which would be log(1/1)
-> 0 or is one element of the sequence supposed to be one document? 
If yes, would it be wise to always load the full document into memory or is the DataSet smart
enough to read the file stream-wise? 

> TF-IDF transformer
> ------------------
>
>                 Key: FLINK-1999
>                 URL: https://issues.apache.org/jira/browse/FLINK-1999
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Ronny Bräunlich
>            Assignee: Alexander Alexandrov
>            Priority: Minor
>              Labels: ML
>
> Hello everybody,
> we are a group of three students from TU Berlin (I guess we're not the first group creating
an issue) and we want to/have to implement a tf-idf tranformer for Flink.
> Our lecturer Alexander told us that we could get some guidance here and that you could
point us to an old version of a similar tranformer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message