lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Pendlebury (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-5722) Add catenateShingles option to WordDelimiterFilter
Date Fri, 13 Feb 2015 03:24:12 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903781#comment-13903781
] 

Greg Pendlebury edited comment on SOLR-5722 at 2/13/15 3:23 AM:
----------------------------------------------------------------

The link to the doco is working for me today so I took a quick look. I think the other reason
that the HyphenatedWordsFilter is not suitable is that it removes the hyphen from the material
assuming that it can only have one meaning. The specific circumstances I am considering is
when the hyphen is part of a legitimately hyphenated word that just happen to break across
a line wrap. eg. 'up-\{\n\}to-date'

The HyphenatedWordsFilter would turn this into 'upto-date', and cause user searches of 'up
to date' to not match, since no filters later in the chain can really pull 'upto' apart again.
Whereas the 'catenateShingles' option is intended to preserve the word delimiter and provide
all the permutations a user might type to find that term: "up to date", "upto date", "up todate",
"uptodate"


was (Author: gpendleb):
The link to the doco is working for me today so I took a quick look. I think the other reason
that the HyphenatedWordsFilter is not suitable is that it removes the hyphen from the material
assuming that it can only have one meaning. The specific circumstances I am considering is
when the hyphen is part of a legitimately hyphenated word that just happen to break across
a line wrap. eg. 'up-\{\n\}to-date'

The HyphenatedWordsFilter would turn this into 'upto-date', and cause user searches of 'up
to date' to not match, since no filters later in the change can really pull 'upto' apart again.
Whereas the 'catenateShingles' option is intended to preserve the word delimiter and provide
all the permutations a user might type to find that term: "up to date", "upto date", "up todate",
"uptodate"

> Add catenateShingles option to WordDelimiterFilter
> --------------------------------------------------
>
>                 Key: SOLR-5722
>                 URL: https://issues.apache.org/jira/browse/SOLR-5722
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Greg Pendlebury
>            Priority: Minor
>              Labels: filter, newbie, patch
>         Attachments: WDFconcatShingles.patch
>
>
> Apologies if I put this in the wrong spot. I'm attaching a patch (against current trunk)
that adds support for a 'catenateShingles' option to the WordDelimiterFilter. 
> We (National Library of Australia - NLA) are currently maintaining this as an internal
modification to the Filter, but I believe it is generic enough to contribute upstream.
> Description:
> =========
> {code}
> /**
>  * NLA Modification to the standard word delimiter to support various
>  * hyphenation use cases. Primarily driven by requirements for
>  * newspapers where words are often broken across line endings.
>  *
>  *  eg. "hyphenated-surname" is printed printed across a line ending and
>  *         turns out like "hyphen-ated-surname" or "hyphenated-sur-name".
>  *
>  *  In this scenario the stock filter, with 'catenateAll' turned on, will
>  *  generate individual tokens plus one combined token, but not
>  *  sub-tokens like "hyphenated surname" and "hyphenatedsur name".
>  *
>  *  So we add a new 'catenateShingles' to achieve this.
> */
> {code}
> Includes unit tests, and as is noted in one of them CATENATE_WORDS and CATENATE_SHINGLES
are logically considered mutually exclusive for sensible usage and can cause duplicate tokens
(although they should have the same positions etc).
> I'm happy to work on it more if anyone finds problems with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message