spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
Date Thu, 15 Dec 2016 01:31:58 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750041#comment-15750041
] 

Joseph K. Bradley commented on SPARK-18374:
-------------------------------------------

Oh nice, I didn't realize that was in use.  I'll start doing that.

> Incorrect words in StopWords/english.txt
> ----------------------------------------
>
>                 Key: SPARK-18374
>                 URL: https://issues.apache.org/jira/browse/SPARK-18374
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.1
>            Reporter: nirav patel
>            Assignee: yuhao yang
>            Priority: Minor
>              Labels: releasenotes
>             Fix For: 2.2.0
>
>
> I was just double checking english.txt for list of stopwords as I felt it was taking
out valid tokens like 'won'. I think issue is english.txt list is missing apostrophe character
and all character after apostrophe. So "won't" becam "won" in that list; "wouldn't" is "wouldn"
.
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be part of
english.txt as some tokenizer might remove special characters. But 'won' is obviously shouldn't
be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message