spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yuhao yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-8703) Add CountVectorizer as a ml transformer to convert document to words count vector
Date Wed, 01 Jul 2015 08:00:10 GMT

    [ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609703#comment-14609703
] 

yuhao yang commented on SPARK-8703:
-----------------------------------

Thanks Joseph. 

It's true that CountVectorizer and HashingTF share similar input and output, yet currently
CountVectorizer does not actually inherit anything useful from HashingTF. And I kind of like
the current clean separation among the feature transformers. I'm prone to undo the extension.

About code reuse, given HashingTF is invoking the version in mllib and the fact that it's
a quite straightforward implementation, it may not be necessary to do any refactor for code
reuse.

[~viirya] and [~fliang]. Thanks for your opinions and I'd like to know your thoughts about
it.

> Add CountVectorizer as a ml transformer to convert document to words count vector
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-8703
>                 URL: https://issues.apache.org/jira/browse/SPARK-8703
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: yuhao yang
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Converts a text document to a sparse vector of token counts. Similar to http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
> I can further add an estimator to extract vocabulary from corpus if that's appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message