spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abou Haydar Elias (JIRA)" <>
Subject [jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
Date Wed, 18 Mar 2015 12:54:38 GMT


Abou Haydar Elias commented on SPARK-5874:

The tokenizer as for now converts the input string to lowercase and then splits it by white
spaces only. 

I suggest more flexibility for the Tokenizer pipeline stage. So we can eventually add stemming
and text analysis directly into the Tokenizer.

There are many post-tokenization steps that can be done, including (but not limited to):

- [Stemming|] – Replacing words with their stems. For
instance with English stemming "bikes" is replaced with "bike"; now query "bike" can find
both documents containing "bike" and those containing "bikes".
- Stop Words Filtering – Common words like "the", "and" and "a" rarely add any value to
a search. Removing them shrinks the index size and increases performance. It may also reduce
some "noise" and actually improve search quality.
- [Text Normalization|] – Stripping accents
and other character markings can make for better searching.
- Synonym Expansion – Adding in synonyms at the same token position as the current word
can mean better matching when users search with words in the synonym set.

so what do you think?

> How to improve the current ML pipeline API?
> -------------------------------------------
>                 Key: SPARK-5874
>                 URL:
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
> I created this JIRA to collect feedbacks about the ML pipeline API we introduced in Spark
1.2. The target is to graduate this set of APIs in 1.4 with confidence, which requires valuable
input from the community. I'll create sub-tasks for each major issue.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message