lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Della Bitta <michael.della.bi...@appinions.com>
Subject Re: WordDelimiter filter, expanding to multiple words, unexpected results
Date Tue, 02 Sep 2014 16:59:49 GMT
Hi Jonathan,

Little confused by this line:

> And, what I think it's trying to do, is match text indexed as "d elalain"
as well as text indexed by "delalain".

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <rochkind@jhu.edu> wrote:

> Hello, I'm running into a case where a query is not returning the results
> I expect, and I'm hoping someone can offer some explanation that might help
> me fine tune things or understand what's up.
>
> I am running Solr 4.3.
>
> My filter chain includes a WordDelimiterFilter and, later a filter that
> downcases everything for case-insensitive searching. It includes many other
> things too, but I think these are the pertinent facts.
>
> For query "dELALAIN", the WordDelimiterFilter splits into:
>
> text: d
> start: 0
> position: 1
>
> text: ELALAIN
> start: 1
> position: 2
>
> text: dELALAIN
> start: 0
> position: 2
>
> Note the duplication/overlap of the tokens -- one version with "d" and
> "ELALAIN" split into two tokens, and another with just one token.
>
> Later, all the tokens are lowercased by another filter in the chain.
> (actually an ICU filter which is doing something more complicated than just
> lowercasing, but I think we can consider it lowercasing for the purposes of
> this discussion).
>
> If I understand right what the WordDelimiterFilter is trying to do here,
> it's probably doing something special because of the lowercase "d" followed
> by an uppercase letter, a special case for that. (I don't get this behavior
> with other mixed case queries not beginning with 'd').
>
> And, what I think it's trying to do, is match text indexed as "d elalain"
> as well as text indexed by "delalain".
>
> The problem is, it's not accomplishing that -- it is NOT matching text
> that was indexed as "delalain" (one token).
>
> I don't entirely understand what the "position" attribute is for -- but I
> wonder if in this case, the position on "dELALAIN" is really supposed to be
> 1, not 2?  Could that be responsible for the bug?  Or is position
> irrelevant in this case?
>
> If that's not it, then I'm at a loss as to what may be causing this bug --
> or even if it's a bug at all, or I'm just not understanding intended
> behavior. I expect a query for "dELALAIN" to match text indexed as
> "delalain" (because of the forced lowercasing in the filter chain). But
> it's not doing so. Are my expectations wrong? Bug? Something else?
>
> Thanks for any advice,
>
> Jonathan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message